Similarity query support in big data management systems
پشتیبانی پرس و جوی شباهت ها در سیستم های مدیریت داده های بزرگ-2020
Similarity query processing is becoming increasingly important in many applications such as data cleaning, record linkage, Web search, and document analytics. In this paper we study how to provide end-to-end similarity query support natively in a parallel database system. We discuss how to express a similarity predicate in its query language, how to build indexes, how to answer similarity queries (selections and joins) efficiently in the runtime engine, possibly using indexes, and how to optimize similarity queries. One particular challenge is how to incorporate existing similarity join algorithms, which often require a series of steps to achieve a high efficiency, including collecting token frequencies, finding matching record id pairs, and reassembling result records based on id pairs. We present a novel approach that uses existing runtime operators to implement such complex join algorithms without reinventing the wheel; doing so positions the system to automatically benefit from future improvements to those operators. The approach includes a technique to transform a similarity join plan into an efficient operator-based physical plan during query optimization by using a template expressed largely in the system’s user-level query language; this technique greatly simplifies the specification of such a transformation rule. We use Apache AsterixDB, a parallel Big Data management system, to illustrate and validate our techniques. We conduct an experimental study using several large, real datasets on a parallel computing cluster to assess the similarity query support. We also include experiments involving three other parallel systems and report the efficacy and performance results.
Keywords: Similarity query | Parallel database | Optimization
An extensive study on the evolution of context-aware personalized travel recommender systems
یک مطالعه گسترده در مورد تکامل سیستمهای توصیه گر سفر شخصی آگاه از متن-2020
Ever since the beginning of civilization, travel for various causes exists as an essential part of human life so as travel recommendations, though the early form of recommendations were the accrued experiences shared by the community. Modern recommender systems evolved along with the growth of Information Technology and are contributing to all industry and service segments inclusive of travel and tourism. The journey started with generic recommender engines which gave way to personalized recommender systems and further advanced to contextualized personalization with advent of artificial intelligence. Current era is also witnessing a boom in social media usage and the social media big data is acting as a critical input for various analytics with no exception for recommender systems. This paper details about the study conducted on the evolution of travel recommender systems, their features and current set of limitations. We also discuss on the key algorithms being used for classification and recommendation processes and metrics that can be used to evaluate the performance of the algorithms and thereby the recommenders.
Keywords: Recommender system | Personalization | Context aware | Big data | Travel and tourism
Towards a real-time processing framework based on improved distributed recurrent neural network variants with fastText for social big data analytics
به سمت یک چارچوب پردازش در زمان واقعی بر اساس بهبود انواع شبکه عصبی مکرر توزیع شده با fastText برای تجزیه و تحلیل داده های بزرگ اجتماعی-2020
Big data generated by social media stands for a valuable source of information, which offers an excellent opportunity to mine valuable insights. Particularly, User-generated contents such as reviews, recommendations, and users’ behavior data are useful for supporting several marketing activities of many companies. Knowing what users are saying about the products they bought or the services they used through reviews in social media represents a key factor for making decisions. Sentiment analysis is one of the fundamental tasks in Natural Language Processing. Although deep learning for sentiment analysis has achieved great success and allowed several firms to analyze and extract relevant information from their textual data, but as the volume of data grows, a model that runs in a traditional environment cannot be effective, which implies the importance of efficient distributed deep learning models for social Big Data analytics. Besides, it is known that social media analysis is a complex process, which involves a set of complex tasks. Therefore, it is important to address the challenges and issues of social big data analytics and enhance the performance of deep learning techniques in terms of classification accuracy to obtain better decisions. In this paper, we propose an approach for sentiment analysis, which is devoted to adopting fastText with Recurrent neural network variants to represent textual data efficiently. Then, it employs the new representations to perform the classification task. Its main objective is to enhance the performance of well-known Recurrent Neural Network (RNN) variants in terms of classification accuracy and handle large scale data. In addition, we propose a distributed intelligent system for real-time social big data analytics. It is designed to ingest, store, process, index, and visualize the huge amount of information in real-time. The proposed system adopts distributed machine learning with our proposed method for enhancing decision-making processes. Extensive experiments conducted on two benchmark data sets demonstrate that our proposal for sentiment analysis outperforms well-known distributed recurrent neural network variants (i.e., Long Short-Term Memory (LSTM), Bidirectional Long Short-Term Memory (BiLSTM), and Gated Recurrent Unit (GRU)). Specifically, we tested the efficiency of our approach using the three different deep learning models. The results show that our proposed approach is able to enhance the performance of the three models. The current work can provide several benefits for researchers and practitioners who want to collect, handle, analyze and visualize several sources of information in real-time. Also, it can contribute to a better understanding of public opinion and user behaviors using our proposed system with the improved variants of the most powerful distributed deep learning and machine learning algorithms. Furthermore, it is able to increase the classification accuracy of several existing works based on RNN models for sentiment analysis.
Keywords: Big data | FastText | Recurrent neural networks | LSTM | BiLSTM | GRU | Natural language processing | Sentiment analysis | Social big data analytics
Dynamic occupant density models of commercial buildings for urban energy simulation
مدلهای چگالی اشغال پویا ساختمانهای تجاری برای شبیه سازی انرژی شهری-2020
The number of occupants and its changing pattern over time are key information for building and urban energy simulation. However, the commonly used assumption and simplification of a fixed occupancy schedule does not reflect the complicated reality, leading to significant errors in energy simulation. Therefore, dynamic occupant density models which describe the real-world situation more accurately should be developed. This paper presents a methodology to develop such a model for commercial buildings and expand it from the building level to urban level. First, a total of 2275 commercial buildings in Nanjing, a major city in China, are identified and classified into three sub-categories using Points of Interest and logistic regression. Then field measurement is conducted to obtain the hourly occupant density for 12 sample commercial buildings. The building-level dynamic occupant density model is developed by fitting normal distribution functions into the measured data. Finally, transportation accessibility and population level, two urban parameters, are defined and used to expand the buildinglevel occupant density model to the urban-level one. The dynamic urban-level occupant density model is verified for all three sub-categories of commercial buildings and the overall results are acceptable.
Keywords: Big data | Commercial buildings | Urban-level | Dynamic occupant density models
Impact factors of the real-world fuel consumption rate of light duty vehicles in China
عوامل مؤثر بر میزان مصرف سوخت در دنیای واقعی از وسایل نقلیه سبک وزن در چین-2020
Measuring real-world fuel consumption of light duty vehicles can be challenging due to the limited collection of actual data. In this paper, we use big data retrieved from the record of real-world fuel consumptions of different brands of vehicles in different areas (n ¼ 106,809 samples from 201 brands of vehicles and 34 cities) in China to build up a real-world fuel consumption rate (RFCR) model to estimate the fuel consumption given the driving conditions and figure out the main factors that affect actual fuel consumption in the real world.We find the average deviation of actual fuel consumptions and the fitting results of RFCR model is 4.22% , which does not significantly differ from zero, and the fuel consumptions calculated by RFCR model tend to be 1.40 L/100 km (about 25%) higher than the official reported data. Furthermore, we find that annual average temperature and altitude factors significantly influence the fuel consumption rate. The results indicate that there is a real world performance discrepancy between the theoretical fuel consumption released by authorities and that in the real world, and some green behaviors (choose light duty vehicles, reduce the use of air conditioning and change to manual transmission type) can reduce energy consumption of vehicles.
Keywords: Real-world fuel consumption rate | Energy consumption | Private passenger vehicles | Big data | China
Analysis of substance use and its outcomes by machine learning I: Childhood evaluation of liability to substance use disorder
تجزیه و تحلیل استفاده از مواد و نتایج آن با یادگیری ماشین I: ارزیابی کودک از مسئولیت در برابر اختلال در مصرف مواد-2020
Background: Substance use disorder (SUD) exacts enormous societal costs in the United States, and it is important to detect high-risk youths for prevention. Machine learning (ML) is the method to find patterns and make prediction from data. We hypothesized that ML identifies the health, psychological, psychiatric, and contextual features to predict SUD, and the identified features predict high-risk individuals to develop SUD. Method: Male (N=494) and female (N=206) participants and their informant parents were administered a battery of questionnaires across five waves of assessment conducted at 10–12, 12–14, 16, 19, and 22 years of age. Characteristics most strongly associated with SUD were identified using the random forest (RF)algorithm from approximately 1000 variables measured at each assessment. Next, the complement of features was validated, and the best models were selected for predicting SUD using seven ML algorithms. Lastly, area under the receiver operating characteristic curve (AUROC) evaluated accuracy of detecting individuals who develop SUD +/- up to thirty years of age. Results: Approximately thirty variables strongly predict SUD. The predictors shift from psychological dysregulation and poor health behavior in late childhood to non-normative socialization in mid to late adolescence. In 10–12-year-old youths, the features predict SUD+/- with 74% accuracy, increasing to 86% at 22 years of age. The RF algorithm optimally detects individuals between 10–22 years of age who develop SUD compared to other ML algorithms. Conclusion: These findings inform the items required for inclusion in instruments to accurately identify high risk youths and young adults requiring SUD prevention
Keywords: Substance use disorder | Machine learning | Substance abuse prevention | Big data | Screening addiction risk
A novel intelligent option price forecasting and trading system by multiple kernel adaptive filters
رویکرد پیش بینی قیمت و گزینه سیستم تجاری با فیلترهای انطباقی چند هسته ای-2020
Derivatives such as options are complex financial instruments. The risk in option trading leads to the demand of trading support systems for investors to control and hedge their risk. The nonlinearity and non-stationarity of option dynamics are the main challenge of option price forecasting. To address the problem, this study develops a multi-kernel adaptive filters (MKAF) for online option trading. MKAF is an improved version of the adaptive filter, which employs multiple kernels to enhance the richness of nonlinear feature representation. The MKAF is a fully adaptive online algorithm. The strength of MKAF is that the weights to the kernels are simultaneous optimally determined in filter coefficient updates. We do not need to design the weights separately. Therefore, MKAF is good at tracking nonstationary nonlinear option dynamics. Moreover, to reduce the computation time in updating the filter, and prevent overadaptation, the number of kernels is restricted by using coherence-based sparsification, which constructs a set of dictionary and uses a coherence threshold to restrict the dictionary size. This study compared the new method with traditional ones, we found the performance improvement is significant and robust. Especially, the cumulated trading profits are substantially increased
Keywords: Artificial intelligence | Adaptive filter | Multiple Kernel Machine | Big data analysis | Data mining | Financial forecasting
Understanding the impact of business analytics on innovation
درک تأثیر تحلیل های تجاری بر نوآوری-2020
Advances in Business Analytics in the era of Big Data have provided unprecedented opportunities for or- ganizations to innovate. With insights gained from Business Analytics, companies are able to develop new or improved products/services. However, few studies have investigated the mechanism through which Business Analytics contributes to a firm’s innovation success. This research aims to address this gap by theoretically and empirically investigating the relationship between Business Analytics and innovation. To achieve this aim, absorptive capacity theory is used as a theoretical lens to inform the development of a research model. Absorptive capacity theory refers to a firm’s ability to recognize the value of new, external information, assimilate it and apply it to commercial ends. The research model covers the use of Business Analytics, environmental scanning, data-driven culture, innovation (new product newness and meaningfulness), and competitive advantage. The research model is tested through a questionnaire survey of 218 UK businesses. The results suggest that Business Analytics directly improves environmental scan- ning which in turn helps to enhance a company’s innovation. Business Analytics also directly enhances data-driven culture that in turn impacts on environmental scanning. Data-driven culture plays another important role by moderating the effect of environmental scanning on new product meaningfulness. The findings demonstrate the positive impact of business analytics on innovation and the pivotal roles of en- vironmental scanning and data-driven culture. Organizations wishing to realize the potential of Business Analytics thus need changes in both their external and internal focus.
Keywords: Analytics | Innovation | Big Data | Data-driven culture | Absorptive capacity
An analytic infrastructure for harvesting big data to enhance supply chain performance
یک زیرساخت تحلیلی برای برداشت داده های بزرگ به منظور افزایش عملکرد زنجیره تأمین-2020
Big data has already received a tremendous amount of attention from managers in every industry, policy and decision makers in governments, and researchers in many different areas. However, the current big data analytics have conspicuous limitations, especially when dealing with information silos. In this pa- per, we synthesise existing researches on big data analytics and propose an integrated infrastructure for breaking down the information silos, in order to enhance supply chain performance. The analytic infras- tructure effectively leverages rich big data sources (i.e. databases, social media, mobile and sensor data) and quantifies the related information using various big data analytics. The information generated can be used to identify a required competence set (which refers to a collection of skills and knowledge used for specific problem solving) and to provide roadmaps to firms and managers in generating actionable supply chain strategies, facilitating collaboration between departments, and generating fact-based opera- tional decisions. We showcase the usefulness of the analytic infrastructure by conducting a case study in a world-leading company that produces sports equipment. The results indicate that it enabled managers: (a) to integrate information silos in big data analytics to serve as inputs for new product ideas; (b) to capture and interrelate different competence sets to provide an integrated perspective of the firm’s op- erations capabilities; and (c) to generate a visual decision path that facilitated decision making regarding how to expand competence sets to support new product development.
Keywords: Decision support systems | Big data | Analytic infrastructure | Competence set | Deduction graph
Attacking and defending multiple valuable secrets in a big data world
حمله و دفاع از اسرار چند ارزشمندی در جهان داده های بزرگ-2020
This paper studies the attack-and-defence game between a web user and a whole set of players over this user’s ‘valuable secrets.’ The number and type of these valuable secrets are the user’s private information. Attempts to tap information as well as privacy protection are costly. The multiplicity of secrets is of strategic value for the holders of these secrets. Users with few secrets keep their secrets private with some probability, even though they do not protect them. Users with many secrets protect their secrets at a cost that is smaller than the value of the secrets protected. The analysis also accounts for multiple redundant information channels with cost asymmetries, relating the analysis to attack-and-defence games with a weakest link.
Keywords: Big-data | Privacy | Conflict | Valuable secrets | Attack-and-defence