FPO tree and DP3 algorithm for distributed parallel Frequent Itemsets Mining
الگوریتم درخت FPO و DP3 برای کاوش موارد متداول توزیع شده موازی-2020
Frequent Itemsets Mining is a fundamental mining model in Data Mining. It supports a vast range of ap- plication fields and can be employed as a key calculation phase in many other mining models such as Association Rules, Correlations, Classifications, etc. Many distributed parallel algorithms have been intro- duced to confront with very large-scale datasets of Big Data. However, the problems of running time and memory scalability still have not had adequate solutions for very large and “hard-to-mined”datasets. In this paper, we propose a distributed parallel algorithm named DP3 ( D istributed P re P ost P lus) which parallelizes the state-of-the-art algorithm PrePost + and operates in Master-Slaves model. Slave machines mine and send local frequent itemsets and support counts to the Master for aggregations. In the case of tremendous numbers of itemsets transferred between the Slaves and Master, the computational load at the Master, therefore, is extremely heavy if there is not the support from our complete FPO tree ( F requent P atterns O rganization) which can provide optimal compactness for light data transfers and highly efficient aggregations with pruning ability. Processing phases of the Slaves and Master are designed for memory scalability and shared-memory parallel in Work-Pool model so as to utilize the computational power of multi-core CPUs. We conducted experiments on both synthetic and real datasets, and the empirical results have shown that our algorithm far outperforms the well-known PFP and other three recently high-performance ones Dist-Eclat, BigFIM, and MapFIM.
Keywords: Frequent Itemsets Mining | Parallel | Distributed | Data Mining | Big Data | Prefix tree
Forecasting across time series databases using recurrent neural networks on groups of similar series: A clustering approach
پیش بینی پیش بینی پایگاه داده های سری زمانی با استفاده از شبکه های عصبی مکرر در گروه های مشابه سری: یک روش خوشه بندی-2020
With the advent of Big Data, nowadays in many applications databases containing large quantities of sim- ilar time series are available. Forecasting time series in these domains with traditional univariate fore- casting procedures leaves great potentials for producing accurate forecasts untapped. Recurrent neural networks (RNNs), and in particular Long Short Term Memory (LSTM) networks, have proven recently that they are able to outperform state-of-the-art univariate time series forecasting methods in this context, when trained across all available time series. However, if the time series database is heterogeneous, ac- curacy may degenerate, so that on the way towards fully automatic forecasting methods in this space, a notion of similarity between the time series needs to be built into the methods. To this end, we present a prediction model that can be used with different types of RNN models on subgroups of similar time series, which are identified by time series clustering techniques. We assess our proposed methodology using LSTM networks, a widely popular RNN variant, together with various clustering algorithms, such as kMeans, DBScan, Partition Around Medoids (PAM), and Snob. Our method achieves competitive results on benchmarking datasets under competition evaluation procedures. In particular, in terms of mean sMAPE accuracy it consistently outperforms the baseline LSTM model, and outperforms all other methods on the CIF2016 forecasting competition dataset.
Keywords: Big data forecasting | RNN | LSTM | Time series clustering | Neural networks
Decision-making techniques in supplier selection: Recent accomplishments and what lies ahead
تکنیک های تصمیم گیری در انتخاب تأمین کننده: دستاوردهای اخیر و آنچه پیش رو است-2020
Supplier selection (SS) is considered a sophisticated, application-oriented, decision-making (DM) problem and has received considerable attention. In the past two decades, DM theories and techniques continue to be incorporated into and contribute to the development of SS applications. Maintaining the pace of the rapid transitions in this field, this paper systematically reviews the relevant articles published be- tween 2013 and 2018. Articles that orient various DM techniques are selected and analyzed under a well- established framework. State-of-the-art developments in the adoption of DM techniques are summarized in a SS process. We pay particular attention to promising directions that can dominate future research in this field. This paper further extends the history of several interacting fields, including big data and eco- nomic theories, toward methodological rather than application dimensions. The potential of such fields for SS is discussed from an interdisciplinary perspective.
Keywords: Supplier selection | Decision making | Big data | Multiple criteria | Artificial intelligence | Literature review
Distributed mining of high utility time interval sequential patterns using mapreduce approach
کاوش توزیع شده الگوهای پی در پی فاصله زمانی ابزار مطلوب با استفاده از روش Mapreduce-2020
High Utility Sequential Pattern mining (HUSP) algorithms aim to find all the high utility sequences from a sequence database. Due to the large explosion of data, recently few distributed algorithms have been designed for mining HUSPs based on the MapReduce framework. However, the existing HUSP algorithms such as USpan, HUS-Span and BigHUSP are able to predict only the order of items, they do not pre- dict the time between the items, that is, they do not include the time intervals between the successive items. But in a real-world scenario, time interval patterns provide more valuable information than con- ventional high utility sequential patterns. Therefore, we propose a distributed high utility time interval sequential pattern mining (DHUTISP) algorithm using the MapReduce approach that is suitable for big data. DHUTISP creates a novel time interval utility linked list data structure (TIUL) to efficiently calculate the utility of the resulting patterns. Moreover, two utility upper bounds, namely, remaining utility upper bound (RUUB) and co-occurrence utility upper bound (CUUB) are proposed to prune the unpromising candidates. We conducted various experiments to prove the efficiency of the proposed algorithm over both the distributed and non-distributed approaches. The experimental results show the efficiency of DHUTISP over state-of-the-art algorithms, namely, BigHUSP, AHUS-P, PUSOM and UTMining_A.
Keywords: Big data | High utility itemset mining | High utility sequential pattern mining | Time interval sequential pattern mining | Mapreduce framework
Hybrid neural networks for big data classification
شبکه های عصبی ترکیبی برای طبقه بندی داده های بزرگ-2020
Two new hybrid neural architectures combining morphological neurons and perceptrons are introduced in this paper. The first architecture, called Morphological - Linear Neural Network (MLNN) consists of a hidden layer of morphological neurons and an output layer of classical perceptrons has the capability of extracting features. The second architecture, called Linear-Morphological Neural Network (LMNN) is com- posed of one or several perceptron layers as a feature extractor, it is then followed by an output layer of morphological neurons for non-linear classification. Both architectures are trained by stochastic gradient descent. One of the main contributions of this paper is to show that the morphological layer offers a greater capacity to extract features than the perceptron layer. This claim is supported both theoretically and experimentally. We prove that the morphological layer possesses a greater capacity per computation unit to segment the 2D input space than the perceptron layer. In other words, adding more hyper-boxes produces more response regions than adding hyperplanes. From an empirical point of view, we test the two new models on 25 standard datasets at low dimensionality and one big data dataset. The result is that MLNN requires a lesser number of learning parameters than the other tested architectures while achieving better accuracies.
Keywords: Morphological neurons | Dendrite processing | Neural networks | Multilayer perceptron | Big data
Associations of hospital discharge services with potentially avoidable readmissions within 30 days among older adults after rehabilitation in acute care hospitals in Tokyo, Japan
انجمن خدمات ترخیص بیمارستان با بستری مجدد بالقوه قابل اجتناب در عرض 30 روز در میان سالمندان بعد از توانبخشی در بیمارستانهای مراقبت حاد در توکیو ، ژاپن-2020
OBJECTIVE: To examine the associations of three major hospital discharge services covered under health insurance (discharge planning, rehabilitation discharge instruction, and coordination with community care) with potentially avoidable readmissions within 30 days (30-day PAR) in older adults after rehabilitation in acute care hospitals in Tokyo, Japan.
DESIGN: Retrospective cohort study using a large-scale medical claims database of all Tokyo residents aged ≥75 years. SETTING: Acute care hospitals PARTICIPANTS: Patients who underwent rehabilitation and were discharged to home (n=31,247; mean age: 84.1 years, standard deviation: 5.7 years) between October 2013 and July 2014.
MAIN OUTCOME MEASURE: 30-day PAR.
RESULTS: Among the patients, 883 (2.9%) experienced 30-day PAR. A multivariable logistic generalized estimating equation model (with a logit link function and binominal sampling distribution) that adjusted for patient characteristics and clustering within hospitals showed that the discharge services were not significantly associated with 30-day PAR. The odds ratios were 0.962 (95% confidence interval [CI]: 0.805-1.151) for discharge planning, 1.060 (95% CI: 0.916-1.227) for rehabilitation discharge instruction, and 1.118 (95% CI: 0.817-1.529) for coordination with community care. In contrast, the odds of 30-day PAR among patients with home medical care services were 1.431 times higher than those of patients without these services (P<0.001), and the odds of 30-day PAR among patients with a higher number (median or higher) of rehabilitation units were 2.031 times higher than those of patients with a lower number (below median) (P<0.001). Also, the odds of 30-day PAR among patients with a higher hospital frailty risk score (median or higher) were 1.252 times higher than those of patients with a lower score (below median) (P=0.001).
CONCLUSIONS: The insurance-covered discharge services were not associated with 30-day PAR, and the development of comprehensive transitional care programs through the integration of existing discharge services may help to reduce such readmissions.
Copyright © 2020. Published by Elsevier Inc.
KEYWORDS: Big data; health services for the aged; patient readmission; rehabilitation; transitional care
Rigor and reproducibility for data analysis and design in the behavioral sciences
دقت و تکرارپذیری برای تجزیه و تحلیل داده ها و طراحی در علوم رفتاری-2020
The rigor and reproducibility of science methods depends heavily on the appropriate use of statistical methods to answer research questions and make meaningful and accurate inferences based on data. The increasing analytic complexity and valuation of novel statistical and methodological approaches to data place greater emphasis on statistical review. We will outline the controversies within statistical sciences that threaten rigor and reproducibility of research published in the behavioral sciences and discuss ongoing approaches to generate reliable and valid inferences from data. We outline nine major areas to consider for generally evaluating the rigor and reproducibility of published articles and apply this framework to the 116 Behaviour Research and Therapy (BRAT) articles published in 2018. The results of our analysis highlight a pattern of missing rigor and reproducibility elements, especially pre-registration of study hypotheses, links to statistical code/output, and explicit archiving or sharing data used in analyses. We recommend reviewers consider these elements in their peer review and that journals consider publishing results of these rigor and reproducibility ratings with manuscripts to incentivize authors to publish these elements with their manuscript.
KEYWORDS: statistics | big data | reproducibility | reliability | p-hacking
Eco-friendliness and fashion perceptual attributes of fashion brands: An analysis of consumers’ perceptions based on twitter data mining
سازگاری با محیط زیست و ویژگی های ادراکی مد برندهای مد: تحلیلی از درک مصرف کنندگان براساس داده کاوی توییتر-2020
This study explores if there is a convergence between the concepts of fashion and eco-friendliness in consumer perception of a fashion brand.We assume that increased eco-friendly perception will influence the brand image positively, with this impact being much higher for luxury than for high and fast fashion brands. The hypotheses are tested using data collected from Twitter. We analyzed the fashion clothing brands with the highest number of followers on the Socialbakers list and applied a novel social network mining methodology that allows measuring the relationship between each brand and two perceptual attributes (fashion and eco-friendliness). The method is based on attribute exemplarsdthat is, Twitter accounts that represent a perceptual attribute. Our exemplars catalyze social media conversations on fashion (identified in our research by the keywords “fashion,” “glamour,” and “style”) and ecofriendliness (keywords “environment” and “ethical business”). Based on social network analysis theory, we computed a similarity function between the followers of the exemplars and those of the brand. The results suggest that there is a correlation between the fashion and the eco-friendliness perceptual attributes of a brand; however, this correlation is far stronger for luxury brands than for high and fast fashion brands. The difference in the correlations confirms the recent tendency of fashion luxury brand to increasingly consider treating environmental issues as part of their core business and not just as added value to the brand’s offer.
Keywords: Fashion brands | Twitter | Consumer perception | Environment | Ethical business | Brand image | Big data
Identification of high impact factors of air quality on a national scale using big data and machine learning techniques
شناسایی عوامل تأثیر زیاد کیفیت هوا در مقیاس ملی با استفاده از داده های بزرگ و تکنیک های یادگیری ماشین-2020
To effectively control and prevent air pollution, it is necessary to study the influential factors of air quality. A number of previous studies have explored the relationships between air pollution and related factors. However, the methods currently used either cannot well address the multicollinearity problem or fail to explain the importance of the influential factors. Moreover, most of the existing literature limited their studied area in a city or a small region and studied factors in one aspect. There is a lack of studies that analyze the influential factors from the perspective of a country or take into consideration multiple variables. To fill the research gap, this paper proposes a multivariate analysis in the national scale to investigate the most important factors of air quality. In order to study as much influential factors as possible, 171 features ranging from environmental, demographical, economic, meteorological, and energy, were collected and analyzed. To tackle such a “big data” problem, a non-linear machine learning algorithm namely Extreme Gradient Boosting (XGBoost) is utilized to model the relationship and measure the variable importance. Geographical Information System (GIS) is employed to preprocess the diversified variables and visualize the results. Performance of XGBoost is compared with other models and its parameters are tuned using Bayesian Optimization. Experimental results of a case study in the U.S. show that our methodology framework can effectively uncover the important factors of air quality. Six kinds of factors are found to have the largest impact on air quality. Practical suggestions are also proposed from the six aspects to control and prevent air pollution.
Keywords: Air quality index | Big data | GIS | National scale | Variable importance | XGBoost
The questions we ask: Opportunities and challenges for using big data analytics to strategically manage human capital resources
سؤالاتی که می پرسیم: فرصت ها و چالش های استفاده از تجزیه و تحلیل داده های بزرگ برای مدیریت استراتژیک منابع سرمایه انسانی-2020
Big data analytics have transformed research in many fields, including the business areas of marketing, accounting and finance, and supply chain management. Yet, the discussion surrounding big data analytics in human resource management has primarily focused on job candidate screenings. In this article, we consider how significant strategic human capital questions can be addressed with big data analytics, enabling HR to enhance overall firm performance. We also examine how new data sources that help assess workforce performance in real time can assist in the identification and development of the knowledge stars that contribute to firm performance disproportionately as well as help reinforce firm capabilities. But in order for big data analytics to be successful in the HR field, regulatory and ethical challenges must also be addressed; these include privacy concerns and, in Europe, the General Data Protection Regulation (GDPR). We conclude by discussing how big data analytics can facilitate strategic change within HR and the organization as a whole.
KEYWORDS: Big data analytics | Workforce analytics | Stakeholder | management | Strategic human | capital | Knowledge stars | Human resource | management