Rapid discrimination of Salvia miltiorrhiza according to their geographical regions by laser induced breakdown spectroscopy (LIBS) and particle swarm optimization-kernel extreme learning machine (PSO-KELM)
تبعیض سریع miltiorrhiza مریم گلی با توجه به مناطق جغرافیایی خود را با طیف سنجی شکست ناشی از لیزر (LIBS) و یادگیری ماشین افراطی بهینه سازی ازدحام ذرات (PSO-KELM)-2020
Laser-induced breakdown spectroscopy (LIBS) coupled with particle swarm optimization-kernel extreme learning machine (PSO-KELM) method was developed for classification and identification of six types Salvia miltiorrhiza samples in different regions. The spectral data of 15 Salvia miltiorrhiza samples were collected by LIBS spectrometer. An unsupervised classification model based on principal components analysis (PCA) was employed first for the classification of Salvia miltiorrhiza in different regions. The results showed that only Salvia miltiorrhiza samples from Gansu and Sichuan Province can be easily distinguished, and the samples in other regions present a bigger challenge in classification based on PCA. A supervised classification model based on KELM was then developed for the classification of Salvia miltiorrhiza, and two methods of random forest (RF) and PSO were used as the variable selection method to eliminate useless information and improve classification ability of the KELM model. The results showed that PSO-KELM model has a better classification result with a classification accuracy of 94.87%. Comparing the results with that obtained by particle swarm optimization-least squares support vector machines (PSO-LSSVM) and PSO-RF model, the PSO-KELM model possess the best classification performance. The overall results demonstrate that LIBS technique combined with PSO-KELM method would be a promising method for classification and identification of Salvia miltiorrhiza samples in different regions.
Keywords: Laser-induced breakdown spectroscopy | Particle swarm optimization | Kernel extreme learning machine | Salvia miltiorrhiza | Classification
Refined composite multivariate multiscale symbolic dynamic entropy and its application to fault diagnosis of rotating machine
آنتروپی پویای نمادین چند متغیره کامپوزیت تصفیه شده و کاربرد آن در تشخیص خطای ماشین چرخشی-2020
Accurate and efficient identification of various fault categories, especially for the big data and multisensory system, is a challenge in rotating machinery fault diagnosis. For the diagnosis problems with massive multivariate data, extracting discriminative and stable features with high efficiency is the significant step. This paper proposes a novel feature extraction method, called Refined Composite multivariate Multiscale Symbolic Dynamic Entropy (RCmvMSDE), based on the refined composite analysis and multivariate multiscale symbolic dynamic entropy. Specifically, multivariate multiscale symbolic dynamic entropy can capture more identification information from multiple sensors with superior computational efficiency, while refine composite analysis guarantees its stability. The abilities of the proposed method to measure the complexity of multivariate time series and identify the signals with different components are discussed based on adequate simulation analysis. Further, to verify the effectiveness of the proposed method on fault diagnosis tasks, a centrifugal pump dataset under constant speed condition and a ball bearing dataset under time-varying speed condition are applied. Compared with the existing methods, the proposed method improves the classification accuracy and F-score to 99.81% and 0.9981, respectively. Meanwhile, the proposed method saves at least half of the computational time. The result shows that the proposed method is effective to improve the efficiency and classification accuracy dealing with the massive multivariate signals.
Keywords: Multivariate multiscale symbolic dynamic | entropy | Random forest | Time-varying speed conditions | Fault diagnosis
Analysis of substance use and its outcomes by machine learning I: Childhood evaluation of liability to substance use disorder
تجزیه و تحلیل استفاده از مواد و نتایج آن با یادگیری ماشین I: ارزیابی کودک از مسئولیت در برابر اختلال در مصرف مواد-2020
Background: Substance use disorder (SUD) exacts enormous societal costs in the United States, and it is important to detect high-risk youths for prevention. Machine learning (ML) is the method to find patterns and make prediction from data. We hypothesized that ML identifies the health, psychological, psychiatric, and contextual features to predict SUD, and the identified features predict high-risk individuals to develop SUD. Method: Male (N=494) and female (N=206) participants and their informant parents were administered a battery of questionnaires across five waves of assessment conducted at 10–12, 12–14, 16, 19, and 22 years of age. Characteristics most strongly associated with SUD were identified using the random forest (RF)algorithm from approximately 1000 variables measured at each assessment. Next, the complement of features was validated, and the best models were selected for predicting SUD using seven ML algorithms. Lastly, area under the receiver operating characteristic curve (AUROC) evaluated accuracy of detecting individuals who develop SUD +/- up to thirty years of age. Results: Approximately thirty variables strongly predict SUD. The predictors shift from psychological dysregulation and poor health behavior in late childhood to non-normative socialization in mid to late adolescence. In 10–12-year-old youths, the features predict SUD+/- with 74% accuracy, increasing to 86% at 22 years of age. The RF algorithm optimally detects individuals between 10–22 years of age who develop SUD compared to other ML algorithms. Conclusion: These findings inform the items required for inclusion in instruments to accurately identify high risk youths and young adults requiring SUD prevention
Keywords: Substance use disorder | Machine learning | Substance abuse prevention | Big data | Screening addiction risk
Predicting and explaining corruption across countries: A machine learning approach
پیش بینی و توضیح فساد در سراسر کشور: رویکرد یادگیری ماشینی-2020
In the era of Big Data, Analytics, and Data Science, corruption is still ubiquitous and is perceived as one of the major challenges of modern societies. A large body of academic studies has attempted to identify and explain the potential causes and consequences of corruption, at varying levels of granularity, mostly through theoretical lenses by using correlations and regression-based statistical analyses. The present study approaches the phenomenon from the predictive analytics perspective by employing contemporary machine learning techniques to discover the most important corruption perception predictors based on enriched/enhanced nonlinear models with a high level of predictive accuracy. Specifically, within the multiclass classification modeling setting that is employed herein, the Random Forest (an ensemble-type machine learning algorithm) is found to be the most accurate prediction/classification model, followed by Support Vector Machines and Artificial Neural Networks. From the practical standpoint, the enhanced predictive power of machine learning algorithms coupled with a multi-source database revealed the most relevant corruption-related information, contributing to the related body of knowledge, generating actionable insights for administrator, scholars, citizens, and politicians. The variable importance results indicated that government integrity, property rights, judicial effectiveness, and education index are the most influential factors in defining the corruption level of significance
Keywords: Corruption perception | Machine learning | Predictive modeling | Random forest | Society policies and regulations |Government integrity | Social development
Automatic bad channel detection in implantable brain-computer interfaces using multimodal features based on local field potentials and spike signals
تشخیص خودکار کانال بد در رابط های قابل کاشت مغز با کامپیوتر با استفاده از ویژگی های چند حالته بر اساس پتانسیل های محلی و سیگنال های لبه-2020
“Bad channels” in implantable multi-channel recordings bring troubles into the precise quantitative description and analysis of neural signals, especially in the current “big data” era. In this paper, we combine multimodal features based on local field potentials (LFPs) and spike signals to detect bad channels automatically using machine learning. On the basis of 2632 pairs of LFPs and spike recordings acquired from five pigeons, 12 multimodal features are used to quantify each channel’s temporal, frequency, phase and firing-rate properties. We implement seven classifiers in the detection tasks, in which the synthetic minority oversampling technique (SMOTE) system and Fisher weighted Euclidean distance sorting (FWEDS) are used to cope with the class imbalance problem. The results of the two-dimensional scatterplots and classifications demonstrate that correlation coefficient, phase locking value, and coherence have good discriminability. For the multimodal features, almost all the classifiers can obtain high accuracy and bad channel detection rate after the SMOTE operation, in which the Random Forests classifier shows relatively better comprehensive performance (accuracy: 0.9092 � 0.0081, precision: 0.9123 � 0.0100, and recall: 0.9057 � 0.0121). The proposed approach can automatically detect bad channels based on multimodal features, and the results provide valuable references for larger datasets.
Keywords: Bad channel | Multimodal feature | LFP | Spike | Machine learning
Leveraging Google Earth Engine (GEE) and machine learning algorithms to incorporate in situ measurement from different times for rangelands monitoring
اهرم موتور زمین گوگل و الگوریتم های یادگیری ماشین برای ترکیب در اندازه گیری درجا از زمان های مختلف برای نظارت بر مراتع-2020
Mapping and monitoring of indicators of soil cover, vegetation structure, and various native and non-native species is a critical aspect of rangeland management. With the advancement in satellite imagery as well as cloud storage and computing, the capability now exists to conduct planetary-scale analysis, including mapping of rangeland indicators. Combined with recent investments in the collection of large amounts of in situ data in the western U.S., new approaches using machine learning can enable prediction of surface conditions at times and places when no in situ data are available. However, little analysis has yet been done on how the temporal relevancy of training data influences model performance. Here, we have leveraged the Google Earth Engine (GEE) platform and a machine learning algorithm (Random Forest, after comparison with other candidates) to identify the potential impact of different sampling times (across months and years) on estimation of rangeland indicators from the Bureau of Land Managements (BLM) Assessment, Inventory, and Monitoring (AIM) and Landscape Monitoring Framework (LMF) programs. Our results indicate that temporally relevant training data improves predictions, though the training data need not be from the exact same month and year for a prediction to be temporally relevant. Moreover, inclusion of training data from the time when predictions are desired leads to lower prediction error but the addition of training data from other times does not contribute to overall model error. Using all of the available training data can lead to biases, toward the mean, for times when indicator values are especially high or low. However, for mapping purposes, limiting training data to just the time when predictions are desired can lead to poor predictions of values outside the spatial range of the training data for that period. We conclude that the best Random Forest prediction maps will use training data from all possible times with the understanding that estimates at the extremes will be biased.
Keywords: Google earth engine | Big data | Machine learning | Domain adaptation | Transfer learning | Feature selection | Rangeland monitoring
Comparative analysis of surface water quality prediction performance and identification of key water parameters using different machine learning models based on big data
تجزیه و تحلیل مقایسه ای عملکرد پیش بینی کیفیت آب سطحی و شناسایی پارامترهای کلیدی آب با استفاده از مدل های مختلف یادگیری ماشین بر اساس داده های بزرگ-2020
The water quality prediction performance of machine learning models may be not only dependent on the models, but also dependent on the parameters in data set chosen for training the learning models. Moreover, the key water parameters should also be identified by the learning models, in order to further reduce prediction costs and improve prediction efficiency. Here we endeavored for the first time to compare the water quality prediction performance of 10 learning models (7 traditional and 3 ensemble models) using big data (33,612 observations) from the major rivers and lakes in China from 2012 to 2018, based on the precision, recall, F1-score, weighted F1-score, and explore the potential key water parameters for future model prediction. Our results showed that the bigger data could improve the performance of learning models in prediction of water quality. Compared to other 7 models, decision tree (DT), random forest (RF) and deep cascade forest (DCF) trained by data sets of pH, DO, CODMn, and NH3 eN had significantly better performance in prediction of all 6 Levels of water quality recommended by Chinese government. Moreover, two key water parameter sets (DO, CODMn, and NH3eN; CODMn, and NH3eN) were identified and validated by DT, RF and DCF to be high specificities for perdition water quality. Therefore, DT, RF and DCF with selected key water parameters could be prioritized for future water quality monitoring and providing timely water quality warning.
Keywords: Water quality prediction | Machine learning models | Ensemble methods | Deep cascade forest | The key water parameters
Optimizing hyperparameters of deep learning in predicting bus passengers based on simulated annealing
بهینه سازی پارامترهای یادگیری عمیق در پیش بینی مسافران اتوبوس مبتنی بر بازپخت شبیه سازی شده-2020
Bus is certainly one of the most widely used public transportation systems in a modern city because it provides an inexpensive solution to public transportation users, such as commuters and tourists. Most people would like to avoid taking a crowded bus on the way. That is why forecasting the number of bus passengers has been a critical problem for years. The proposed method is inspired by the fact that there is no easy way to know the suitable parameters for most of the deep learning methods in solving the optimization problem of forecasting the number of passengers on a bus. To address this issue, the proposed algorithm uses a simulated annealing (SA) to find out a suitable number of neurons for each layer of a fully connected deep neural network (DNN) to enhance the accuracy rate in solving this particular optimization problem. The proposed method is compared with support vector machine, random forest, eXtreme gradient boosting, deep neural network, and deep neural network with dropout for the data provided by the Taichung city smart transportation big data research center, Taiwan (TSTBDRC). Our simulation results indicate that the proposed method outperforms all the other forecasting methods for forecasting the number of bus passengers in terms of the accuracy rate and the prediction time.
Keywords: Bus transportation system | Simulated annealing | Deep learning | Hyperparameter optimization
Measuring urban poverty using multi-source data and a random forest algorithm: A case study in Guangzhou
اندازه گیری فقر شهری با استفاده از داده های چند منبع و الگوریتم جنگل تصادفی: یک مطالعه موردی در گوانگژو-2020
Conventional measurements of urban poverty mainly rely on census data or aggregated statistics. However, these data are produced with a relatively long cycle, and they hardly reflect the built environment characteristics that affect the livelihoods of the inhabitants. Open-access social media data can be used as an alternative data source for the study of poverty. They typically provide fine-grained information with a short updating cycle. Therefore, in this study, we developed a new approach to measure urban poverty using multi-source big data. We used social media data and remote sensing images to represent the social conditions and the characteristics of built environments, respectively. These data were used to produce the indicators of material, economic, and living conditions, which are closely related to poverty. They were integrated into a composite index, namely the Multi-source Data Poverty Index (MDPI), based on the random forest (RF) algorithm. A dataset of the General Deprivation Index (GDI) derived from the census data was used as a reference to facilitate the training of RF. A case study was carried out in Guangzhou, China, to evaluate the performance of the proposed MDPI for measuring the community-level urban poverty. The results showed a high consistency between the MDPI and GDI. By analyzing the MDPI results, we found a significantly positive spatial autocorrelation in the community-level poverty condition in Guangzhou. Compared with the GDI approach, the proposed MDPI could be updated more conveniently using big data to provide more timely information of urban poverty.
Keywords : Urban poverty | Multi-source Data Poverty Index | General Deprivation Index | Random forest
Machine-learning based error prediction approach for coarse-grid Computational Fluid Dynamics (CG-CFD)
رویکرد پیش بینی خطا مبتنی بر یادگیری ماشین برای دینامیک سیالات محاسباتی درشت-شبکه (CG-CFD)-2020
Computational Fluid Dynamics (CFD) is one of the modeling approaches essential to identifying the parameters that affect Containment Thermal Hydraulics (CTH) phenomena. While the CFD approach can capture the multidimensional behavior of CTH phenomena, its computational cost is high when modeling complex accident scenarios. To mitigate this expense, we propose reliance on coarse-grid CFD (CG-CFD). Coarsening the computational grid increases the grid-induced error thus requiring a novel approach that will produce a surrogate model predicting the distribution of the CG-CFD local error and correcting the fluid-flow variables. Given sufficiently fine-mesh simulations, a surrogate model can be trained to predict the CG-CFD local errors as a function of the coarse-grid local flow features. The surrogate model is constructed using Machine Learning (ML) regression algorithms. Two of the widely used ML regression algorithms were tested: Artificial Neural Network (ANN) and Random Forest (RF). The proposed CG-CFD method is illustrated with a three-dimensional turbulent flow inside a lid-driven cavity. We studied a set of scenarios to investigate the capability of the surrogate model to interpolate and extrapolate outside the training data range. The proposed method has proven capable of correcting the coarse-grid results and obtaining reasonable predictions for new cases (of different Reynolds number, different grid sizes, or larger geometries). Based on the investigated cases, we found this novel method maximizes the benefit of the available data and shows potential for a good predictive capability.
Keywords: Coarse grid (mesh) | CFD | Machine learning | Discretization error | Big data | Artificial neural network | Random forest | Data-driven