Refined composite multivariate multiscale symbolic dynamic entropy and its application to fault diagnosis of rotating machine
آنتروپی پویای نمادین چند متغیره کامپوزیت تصفیه شده و کاربرد آن در تشخیص خطای ماشین چرخشی-2020
Accurate and efficient identification of various fault categories, especially for the big data and multisensory system, is a challenge in rotating machinery fault diagnosis. For the diagnosis problems with massive multivariate data, extracting discriminative and stable features with high efficiency is the significant step. This paper proposes a novel feature extraction method, called Refined Composite multivariate Multiscale Symbolic Dynamic Entropy (RCmvMSDE), based on the refined composite analysis and multivariate multiscale symbolic dynamic entropy. Specifically, multivariate multiscale symbolic dynamic entropy can capture more identification information from multiple sensors with superior computational efficiency, while refine composite analysis guarantees its stability. The abilities of the proposed method to measure the complexity of multivariate time series and identify the signals with different components are discussed based on adequate simulation analysis. Further, to verify the effectiveness of the proposed method on fault diagnosis tasks, a centrifugal pump dataset under constant speed condition and a ball bearing dataset under time-varying speed condition are applied. Compared with the existing methods, the proposed method improves the classification accuracy and F-score to 99.81% and 0.9981, respectively. Meanwhile, the proposed method saves at least half of the computational time. The result shows that the proposed method is effective to improve the efficiency and classification accuracy dealing with the massive multivariate signals.
Keywords: Multivariate multiscale symbolic dynamic | entropy | Random forest | Time-varying speed conditions | Fault diagnosis
Harnessing demand-side management benefit towards achieving a 100% renewable energy microgrid
بهره برداری از مدیریت تقاضا برای دستیابی به سود 100٪ میکروگرید انرژی تجدید پذیر-2020
Optimal sizing with energy management strategy as a transition pathway towards a sustainable 100% renewable energy-based microgrid is investigated in this paper. Due to the challenges of intermittency of renewable energy, microgrid operations are complicated. Hence, in order to overcome some of the challenges facing microgrid planning and operations, optimal capacity sizing incorporated with energy management strategy considering time-ahead generation prediction is proposed. The system model consists of wind turbine (WT), solar photovoltaic (PV) and battery energy storage system (BESS). The generation forecasting output is used to reschedule the flexible demand resources (FDR) to reduce the mismatch between power demand and supply, and optimal sizing of components is performed jointly to determine the optimal capacity values of the PV, WT, and BESS for minimal investment costs. The optimization results for the scenarios with and without load shifting effects of FDRs are determined and analyzed for the case study. From the results obtained, the application of demand scheduling program using the generation forecasting outputs resulted in a cost-saving of 12.41%. The forecasting model is implemented using a random forest algorithm on python platform and the mixed-integer linear program on MATLAB® environment is used to model and solve the capacity sizing problem.
Keywords: Flexible demand resources (FDR) | Random forest (RF) | Wind turbine (WT) | Photo-voltaic (PV) | Battery energy storage (BESS)
Predicting and explaining corruption across countries: A machine learning approach
پیش بینی و توضیح فساد در سراسر کشور: رویکرد یادگیری ماشینی-2020
In the era of Big Data, Analytics, and Data Science, corruption is still ubiquitous and is perceived as one of the major challenges of modern societies. A large body of academic studies has attempted to identify and explain the potential causes and consequences of corruption, at varying levels of granularity, mostly through theoretical lenses by using correlations and regression-based statistical analyses. The present study approaches the phenomenon from the predictive analytics perspective by employing contemporary machine learning techniques to discover the most important corruption perception predictors based on enriched/enhanced nonlinear models with a high level of predictive accuracy. Specifically, within the multiclass classification modeling setting that is employed herein, the Random Forest (an ensemble-type machine learning algorithm) is found to be the most accurate prediction/classification model, followed by Support Vector Machines and Artificial Neural Networks. From the practical standpoint, the enhanced predictive power of machine learning algorithms coupled with a multi-source database revealed the most relevant corruption-related information, contributing to the related body of knowledge, generating actionable insights for administrator, scholars, citizens, and politicians. The variable importance results indicated that government integrity, property rights, judicial effectiveness, and education index are the most influential factors in defining the corruption level of significance
Keywords: Corruption perception | Machine learning | Predictive modeling | Random forest | Society policies and regulations |Government integrity | Social development
Measuring urban poverty using multi-source data and a random forest algorithm: A case study in Guangzhou
اندازه گیری فقر شهری با استفاده از داده های چند منبع و الگوریتم جنگل تصادفی: یک مطالعه موردی در گوانگژو-2020
Conventional measurements of urban poverty mainly rely on census data or aggregated statistics. However, these data are produced with a relatively long cycle, and they hardly reflect the built environment characteristics that affect the livelihoods of the inhabitants. Open-access social media data can be used as an alternative data source for the study of poverty. They typically provide fine-grained information with a short updating cycle. Therefore, in this study, we developed a new approach to measure urban poverty using multi-source big data. We used social media data and remote sensing images to represent the social conditions and the characteristics of built environments, respectively. These data were used to produce the indicators of material, economic, and living conditions, which are closely related to poverty. They were integrated into a composite index, namely the Multi-source Data Poverty Index (MDPI), based on the random forest (RF) algorithm. A dataset of the General Deprivation Index (GDI) derived from the census data was used as a reference to facilitate the training of RF. A case study was carried out in Guangzhou, China, to evaluate the performance of the proposed MDPI for measuring the community-level urban poverty. The results showed a high consistency between the MDPI and GDI. By analyzing the MDPI results, we found a significantly positive spatial autocorrelation in the community-level poverty condition in Guangzhou. Compared with the GDI approach, the proposed MDPI could be updated more conveniently using big data to provide more timely information of urban poverty.
Keywords : Urban poverty | Multi-source Data Poverty Index | General Deprivation Index | Random forest
Machine-learning based error prediction approach for coarse-grid Computational Fluid Dynamics (CG-CFD)
رویکرد پیش بینی خطا مبتنی بر یادگیری ماشین برای دینامیک سیالات محاسباتی درشت-شبکه (CG-CFD)-2020
Computational Fluid Dynamics (CFD) is one of the modeling approaches essential to identifying the parameters that affect Containment Thermal Hydraulics (CTH) phenomena. While the CFD approach can capture the multidimensional behavior of CTH phenomena, its computational cost is high when modeling complex accident scenarios. To mitigate this expense, we propose reliance on coarse-grid CFD (CG-CFD). Coarsening the computational grid increases the grid-induced error thus requiring a novel approach that will produce a surrogate model predicting the distribution of the CG-CFD local error and correcting the fluid-flow variables. Given sufficiently fine-mesh simulations, a surrogate model can be trained to predict the CG-CFD local errors as a function of the coarse-grid local flow features. The surrogate model is constructed using Machine Learning (ML) regression algorithms. Two of the widely used ML regression algorithms were tested: Artificial Neural Network (ANN) and Random Forest (RF). The proposed CG-CFD method is illustrated with a three-dimensional turbulent flow inside a lid-driven cavity. We studied a set of scenarios to investigate the capability of the surrogate model to interpolate and extrapolate outside the training data range. The proposed method has proven capable of correcting the coarse-grid results and obtaining reasonable predictions for new cases (of different Reynolds number, different grid sizes, or larger geometries). Based on the investigated cases, we found this novel method maximizes the benefit of the available data and shows potential for a good predictive capability.
Keywords: Coarse grid (mesh) | CFD | Machine learning | Discretization error | Big data | Artificial neural network | Random forest | Data-driven
GIS-based groundwater potential mapping in Shahroud plain, Iran: A comparison among statistical (bivariate and multivariate), data mining and MCDM approaches
نقشه برداری پتانسیل آب زیرزمینی مبتنی بر GIS در دشت شاهرود ایران: مقایسه بین روشهای آماری (دو متغیره و چند متغیره) ، داده کاوی و MCDM-2019
In arid and semi-arid areas, groundwater resource is one of themost importantwater sources by the humankind. Knowledge of groundwater distribution over space, associated flow and basic exploitation measures can play a significant role in planning sustainable development, especially in arid and semi-arid areas. Groundwater potentialmapping (GWPM) fits in this context as the tool used to predict the spatial distribution of groundwater. In this researchwe tested four GIS-basedmodels for GWPM, consisting of: i) randomforest (RF); ii) weight of evidence (WoE); iii) binary logistic regression (BLR); and iv) technique for order preference by similarity to ideal solution (TOPSIS) multi-criteria. The Shahroud plain located in Iran, was selected to research thewater scarcity and overexploitation of groundwater resources over the past 20 years. In this research, using Iranian Department ofWater ResourcesManagement data, and extensive field surveys, 122 groundwaterwell datawith high potential yield of ≥11m3 h−1 were selected for GWPM. Specifically, we generated four different models selecting 70% (n=85) of thewells and validated the resulting GWPmaps upon the complementary 30% (n=37).A total of fifteen ground water conditioning factors to explain the groundwater well distribution over the Shahroud plain were selected. From the Advanced Land Observing Satellite (ALOS), a DEM(30mresolution) was extracted to calculate a set of morphometric propertieswhichwere combinedwith thematic ones such as land use/land cover (LU/LC) and Soil Type (ST). Results show that in RF (LU/LC), LR (ST), and AHP (Slope) are the most relevant contributors to groundwater occurrence. After that, using the natural break method, final maps were divided into five susceptibility classes of very low, low,moderate, high, and very high. The accuracy of modelswas ultimately tested using prediction rate (validation data), success rate (training data) and the seed cell area index (SCAI) indicators. Results of validation show that BLR with prediction rate of 0.905 (90.5%) and success rate of 0.918 (91.8%) had higher accuracy than WoE, RF and TOPSIS models with respective prediction rates of 0.885, 0.873 and 0.870 (88.5%, 87.3%, and 87%) and success rate of 0.900, 0.889, and 0.881 (90%, 88.9%, and 88.1%). SCAI results show that all models have acceptable classification accuracy although BLR outperformed the other models in terms of accuracy. Results show that the combination of remote sensing (RS) data and geographic information system (GIS) with new approaches can be used as a powerful tool in GWPM in arid and semi-arid areas. The results of this investigation introduced a potential novel methodology that could be used by decision-makers for the sustainable management of ground water resources.
Keywords: Random forest | Weight of evidence | Binary logistic regression | Decision making | Semi-arid region
Groundwater spring potential mapping using population-based evolutionary algorithms and data mining methods
نقشه برداری بالقوه چشمه آب های زیرزمینی با استفاده از الگوریتم های تکاملی مبتنی بر جمعیت و روش های داده کاوی-2019
Water scarcity inmany regions of theworld has become an unpleasant reality. Groundwater appears to be one of the main natural resources capable to reverse this situation. Uncovering the spatial patterns of groundwater occurrence is a crucial factor that could assist in carrying out successful water resources management projects. The main objective of the current study was to provide a novel methodology approach which utilized Genetic Algorithm( GA) in order to performa feature selection procedure and data mining methods for generating a groundwater spring potential map. Three data mining methods, Naïve Bayes (NB), Support Vector Machine (SVM) and RandomForest (RF) were utilized to construct a groundwater spring potential map that had over 0.81 probability of occurrence for the Wuqi County, Shaanxi Province, China. Groundwater spring locations and sixteen related variables were analyzed, namely: lithology, soil cover, land use cover, normalized difference vegetation index (NDVI), elevation, slope angle, aspect, planform curvature, profile curvature, curvature, stream power index (SPI), stream transport index (STI), topographic wetness index (TWI), mean annual rainfall, distance from river network and distance from road network. The Frequency ratio method was used to weight the variables, whereas a multi-collinearity analysis was performed to identify the relation between the parameters and to decide about their usage. The optimal set of parameters, which was determined by the GA, reduced the number of parameters into twelve removing planformcurvature, profile curvature, curvature and STI. The Receiver Operating Characteristic curve and the area under the curve (AUROC) were estimated so as to evaluate the predictive power of eachmodel. The results indicated that the optimizedmodels were superior in accuracy than the original models. The optimized RF model produced the best results (0.9572), followed by the optimized SVM (0.9529) and the optimized NB (0.8235). Overall, the current study highlights the necessity of applying feature selection techniques in groundwater spring assessments and also that data miningmethods may be a highly powerful investigation approach for groundwater spring potential mapping.
Keywords: Groundwater spring potential mapping | Genetic algorithm | Naïve Bayes | Support Vector Machine | Random Forest | China
Advancing the prediction accuracy of satellite-based PM2:5 concentration mapping: A perspective of data mining through in situ PM2:5 measurements
پیشبرد دقت پیش بینی نقشه برداری غلظت PM2:5 ماهواره ای مبتنی بر ماهواره: چشم انداز کاوی داده از طریق اندازه گیری PM2:5 درجا-2019
Ground-measured PM2.5 concentration data are oftentimes used as a response variable in various satellite-based PM2.5 mapping practices, yet few studies have attempted to incorporate groundmeasured PM2.5 data collected from nearby stations or previous days as a priori information to improve the accuracy of gridded PM2.5 mapping. In this study, Gaussian kernel-based interpolators were developed to estimate prior PM2.5 information at each grid using neighboring PM2.5 observations in space and time. The estimated prior PM2.5 information and other factors such as aerosol optical depth (AOD) and meteorological conditions were incorporated into random forest regression models as essential predictor variables for more accurate PM2.5 mapping. The results of our case study in eastern China indicate that the inclusion of ground-based PM2.5 neighborhood information can significantly improve PM2.5 concentration mapping accuracy, yielding an increase of out-of-sample cross validation R2 by 0.23 (from 0.63 to 0.86) and a reduction of RMSE by 7.72 (from 19.63 to 11.91) mg/m3. In terms of the estimated relative importance of predictors, the PM2.5 neighborhood information played a more critical role than AOD in PM2.5 predictions. Compared with the temporal PM2.5 neighborhood term, the spatially neighboring PM2.5 term has an even larger potential to improve the final PM2.5 prediction accuracy. Additionally, a more robust and straightforward PM2.5 predictive framework was established by screening and removing the least important predictor stepwise from each modeling trial toward the final optimization. Overall, our results fully confirmed the positive effects of ground-based PM2.5 information over spatiotemporally neighboring space on the holistic PM2.5 mapping accuracy.
Keywords: PM2.5 | Aerosol optical depth | Spatiotemporal interpolation | Random forest | Air quality
Using machine learning to estimate a key missing geochemical variable in mining exploration: Application of the Random Forest algorithm to multisensor core logging data
استفاده از یادگیری ماشینی برای برآورد متغیر ژئوشیمیایی از دست رفته کلیدی دراستخراج اکتشاف : کاربرد الگوریتم جنگل تصادفی به داده های ورود به سیستم چند هسته ای-2019
Mining exploration increasingly relies on large, multivariate databases storing data ranging from drill core geochemical analysis to geophysical data or geological descriptions. Utilizing these large datasets to their full potential implies the use of multivariate statistical analysis such as machine learning. The Random Forest algorithm has proved its efficiency in mining applications. In this study we use it to estimate a key geochemical element, sodium, using a multivariate chemo-physical dataset measured on drill cores in the Matagami mining district of Québec, Canada. Sodium is important to characterize hydrothermal alteration in volcanogenic massive sulfide settings, since Na depletion can be used to vector towards ore, but this element is not readily measured by portable X-ray fluorescence (pXRF). We first test the algorithm on a database of over 8000 traditional laboratory geochemistry analyses and find a correlation of 0.95 between estimated and measured Na. We then test the algorithm on the multi-sensor core logging data, including density, magnetic susceptibility, and 15 geochemical elements by pXRF, but borrowing Na from traditional geochemistry (n=260). This yields correlations of 0.66 to 0.75 depending on the training and testing sets. Finally the algorithm is applied to the whole multiparameter database (n=9675) to estimate Na downcore. There is a good general correspondence with the downcore Na patterns seen through traditional geochemistry, and the estimated Na which has much greater spatial resolution. Random Forest appears to be a very good estimation tool when using large amounts of data and variables, as it uses all variables and automatically prioritizes the most useful. This method also allows visualization of the weight of each variable in the estimation. Future studies should compare RF with other methods.
Keywords: Artificial intelligence | Geochemistry | Supervised method | Mineral exploration
Predicting complexation performance between cyclodextrins and guest molecules by integrated machine learning and molecular modeling techniques
پیش بینی عملکرد پیچیدگی بین سیکلودکسترین ها و مولکول های مهمان با یادگیری ماشین یکپارچه و تکنیک های مدل سازی مولکولی-2019
Most pharmaceutical formulation developments are complex and ideal formulations are generally obtained after extensive experimentation. Machine learning is increasingly advancing many aspects in modern society and has achieved significant success in multiple subjects. Current research demonstrated that machine learning can be adopted to build up high-accurate predictive models in drugs/cyclodextrins (CDs) systems. Molecular descriptors of compounds and experimental conditions were employed as inputs, while complexation free energy as outputs. Results showed that the light gradient boosting machine provided significantly improved predictive performance over random forest and deep learning. The mean absolute error was 1.38 kJ/mol and squared correlation coefficient was 0.86. The evaluation of relative importance of molecular descriptors further demonstrated the key factors affecting molecular interactions in drugs/CD systems. In the specific ketoprofeneCD systems, machine learning model showed better predictive performance than molecular modeling calculation, while molecular simulation could provide structural, dynamic and energetic information. The integration of machine learning and molecular simulation could produce synergistic effect for interpreting and predicting pharmaceutical formulations. In conclusion, the developed predictive models were able to quickly and accurately predict the solubilizing capacity of CD systems. Current research has taken an important step toward the application of machine learning in pharmaceutical formulation design.
KEY WORDS : Machine learning | Deep learning | LightGBM | Random forest | Cyclodextrin | Binding free energy | Molecular modeling | Ketoprofen