Predicting complexation performance between cyclodextrins and guest molecules by integrated machine learning and molecular modeling techniques
پیش بینی عملکرد پیچیدگی بین سیکلودکسترین ها و مولکول های مهمان با یادگیری ماشین یکپارچه و تکنیک های مدل سازی مولکولی-2019
Most pharmaceutical formulation developments are complex and ideal formulations are generally obtained after extensive experimentation. Machine learning is increasingly advancing many aspects in modern society and has achieved significant success in multiple subjects. Current research demonstrated that machine learning can be adopted to build up high-accurate predictive models in drugs/cyclodextrins (CDs) systems. Molecular descriptors of compounds and experimental conditions were employed as inputs, while complexation free energy as outputs. Results showed that the light gradient boosting machine provided significantly improved predictive performance over random forest and deep learning. The mean absolute error was 1.38 kJ/mol and squared correlation coefficient was 0.86. The evaluation of relative importance of molecular descriptors further demonstrated the key factors affecting molecular interactions in drugs/CD systems. In the specific ketoprofeneCD systems, machine learning model showed better predictive performance than molecular modeling calculation, while molecular simulation could provide structural, dynamic and energetic information. The integration of machine learning and molecular simulation could produce synergistic effect for interpreting and predicting pharmaceutical formulations. In conclusion, the developed predictive models were able to quickly and accurately predict the solubilizing capacity of CD systems. Current research has taken an important step toward the application of machine learning in pharmaceutical formulation design.
KEY WORDS : Machine learning | Deep learning | LightGBM | Random forest | Cyclodextrin | Binding free energy | Molecular modeling | Ketoprofen
A machine learning classifier for microlensing in wide-field surveys
یک طبقه بندی یادگیری ماشین برای میکرونسیلینگ در مرورهای گسترده-2019
While microlensing is very rare, occurring on average once per million stars observed, current and near-future surveys are coming online with the capability of providing photometry of almost the entire visible sky to depths up to R ∼22 mag or fainter every few days, which will contribute to the detection of black holes and exoplanets through follow-up observations of microlensing events. Based on galactic models, we can expect microlensing events across a vastly wider region of the galaxy, although the cadence of these surveys (2-3 d−1) is lower than traditional microlensing surveys, making efficient detection a challenge. Rapid advances are being made in the utility of time-series data to detect and classify transient events in real-time using very high data-rate surveys, but limited work has been published regarding the detection of microlensing events, particularly for when the data streams are of relatively low-cadence. In this research, we explore the utility of a Random Forest algorithm for identifying microlensing signals using time-series data, with the goal of creating an efficient machine learning classifier that can be applied to search for microlensing in wide-field surveys even with lowcadence data. We have applied and optimized our classifier using the OGLE-II microlensing dataset, in addition to testing with PTF/iPTF survey data and the currently operating ZTF, which applies the same data handling infrastructure that is envisioned for the upcoming LSST.
Keywords: Gravitational microlensing | Classification | Random forest | Machine learning | PTF | ZTF
A flow-based approach for Trickbot banking trojan detection
یک رویکرد مبتنی بر جریان برای شناسایی تروجان بانکی Trickbot-2019
Nowadays, online banking is an attractive way of carrying out financial operations such as ecommerce, e-banking, and e-payments without much effort or the need of any physi- cal presence. This increasing popularity in online banking services and payment systems has created motivation for financial attackers to steal customer‘s credentials and money. Banking trojans have been a way of committing attacks on these financial institutions for more than a decade, and they have become one of the primary drivers of botnet traffic. How- ever, the stealthy nature of financial botnets requires new techniques and novel systems for detection and analysis in order to prevent losses and to ultimately take the botnets down. TrickBot, which specifically threatens businesses in the financial sector and their customers, has been behind man-in-the-browser attacks since 2016. Its main goal is to steal online banking information from victims when they visit their banking websites. In this study, we utilize machine learning techniques to detect TrickBot malware infections and to identify TrickBot related traffic flows without having to analyze network packet payloads, the IP addresses, port numbers and protocol information. Since command and control server IPs are updated almost daily, identification of TrickBot related traffic flows without looking at specific IP addresses is significant. We adopt behavior-based classification that uses artifacts created by the malware during the dynamic analysis of TrickBot malware samples. We compare the performance results of four different state-of-the-art machine learning algorithms, Random Forest, Sequential Minimal Optimization, Multilayer Perceptron, and Logistic Model to identify TrickBot related flows and detect a TrickBot infection. Then, we optimize the proposed classifier via exploring the best hyperparameter and feature set selection. Looking at network packet identifiers such as packet length, packet and flag counts, and inter-arrival times, the Random Forest classifier identifies TrickBot related flows with 99.9534% accuracy, 91.7% true positive rate.
Keywords:Trickbot | Banking trojan | Machine learning | Anomaly traffic detection | Dynamic analysis | Random Fores
On the application of machine learning techniques to derive seismic fragility curves
استفاده از روش های یادگیری ماشین برای استنتاج منحنی های شکنندگی لرزه ای-2019
Deriving the fragility curves is a key step in seismic risk assessment within the performance-based earthquake engineering framework. The objective of this study is to implement machine learning tools (i.e., classification-based tools in particular) for predicting the structural responses and the fragility curves. In this regard, ten different classification-based methods are explored: logistic regression, lasso regression, support vector machine, Naïve Bayes, decision tree, random forest, linear and quadratic discriminant analyses, neural networks, and K-nearest neighbors with the structural responses resulted from the multiple strip analyses. In addition, this study examines the impact of class imbalance in training dataset, which is typical among data of structural responses, when developing classification-based models for predicting structural responses. The statistical results using the implemented dataset demonstrate that among applied methods, random forest and quadratic discriminant analysis are, respectively, preferable with the imbalanced and balanced datasets since they show the highest efficiency in predicting the structural responses. Moreover, a detailed procedure is presented on how to derive the fragility curves based on the classification-based tools. Finally, the sensitivity of the applied machine learning methods to the size of employed dataset is investigated. The results explain that logistic regression, lasso regression, and Naïve Bayes are not sensitive to the size of dataset (i.e., the number of performed time history analyses); while the performance of discriminant analysis significantly depends on the size of applied dataset
Keywords: Fragility curve | Machine learning tools | Imbalanced dataset | Random forest | Support vector machine | Multiple strip analysis
Prediction of the apple scab using machine learning and simple weather stations
پیش بینی apple scabبا استفاده از یادگیری ماشین و ایستگاه های ساده آب و هوا-2019
Apple scab is an economically important pest for apple production. It is controlled by applying fungicides when conditions are ripe for the development of its spores. This occurs when leaves are wet for a long enough time at a given temperature. However, leaf wetness is not a sufficiently well-defined agro-meteorological variable. Moreover, the readings of leaf wetness sensors depend to a large extent on their location within the tree canopy. Here we show that virtual wetness sensors, which are based on the easily obtained meteorological parameters such as temperature, relative humidity and wind speed, can be used in place of physical sensors. To this end, we have first collected data for two growing seasons from two types of wetness sensors planted in four locations in the tree canopy. Then, for each sensor we have built a machine-learning model of leaf wetness using the aforementioned meteorological variables. These models were further used as virtual sensors. Finally, Mills models of apple scab infection were built using both real and virtual sensors and their results were compared. The comparison of apple scab models based on real sensors shows significant variability. In particular, the results of a model depend on the location of the sensor within the canopy. The models based on data obtained from virtual sensors, are similar to the models based on physical sensors. Both types of models generate results within the same range of variability. The outcome of the study shows that the control of apple scab can be based on machine learning models based on standard meteorological variables. These variables can be readily obtained using inexpensive meteorological stations equipped with basic sensors. These results open the way to a widespread application of precise control of apple scab and consequently significant reduction of the use of pesticides in apple production with benefits for environment, human health and economics of production.
Keywords: Apple scab | Machine learning | Random fores
Predicting ground-level PM2:5 concentrations in the Beijing-Tianjin- Hebei region: A hybrid remote sensing and machine learning approach
پیش بینی غلظت PM2:5 در سطح زمین در منطقه پکن، Beijing-Tianjin- هبی: یک روش سنجش از دور و یادگیری ماشین هیبریدی-2019
An accurate estimation of PM2.5 (fine particulate matters with diameters 2.5 mm) concentration is critical for health risk assessment and generating air pollution control strategies. In this study, a hybrid remote sensing and machine learning approach, named RSRF model is proposed to estimate daily ground-level PM2.5 concentrations, which integrates Random Forest (RF), one of machine learning (ML) models, and aerosol optical depth (AOD), one of remote sensing (RS) products. The proposed RSRF model provides an opportunity for an adequate characterization of real-time spatiotemporal PM2.5 distributions at uninhabited places and complex surfaces. It also offers advantages in handling complicated non-linear relationships among a large number of meteorological, environmental and air pollutant factors, as well as ever-increasing environmental data sets. The applicability of the proposed RSRF model is tested in the Beijing-Tianjin-Hebei region (BTH region) during 2015e2017. Deep Blue (DB) AOD from Aqua-retrieved Collection 6.1 (C_61) aerosol products of Moderate Resolution Imaging Spectroradiometer (MODIS) is validated with Aerosol Robotic Network. The validation results indicate C_61 DB AOD has a high correlation with ground based AOD in the BTH region. The proposed RSRF model performed well in characterizing spatiotemporal variations of annual and seasonal PM2.5 concentrations. It not only is useful to quantify the relationships between PM2.5 and relevant factors such as DB AOD, meteorological and air pollutant variables, but also can provide decision support for air pollution control at a regional environment during haze periods.
Keywords: Remote sensing | Aerosol optical depth | Machine learning | PM2.5 | Random forest
Direct marketing campaigns in retail banking with the use of deep learning and random forests
کمپین های بازاریابی مستقیم در بانکداری خرده فروشی با استفاده از یادگیری عمیق و جنگل های تصادفی-2019
Credit products are a crucial part of business of banks and other financial institutions. A novel approach based on time series of customer’s data representation for predicting willingness to take a personal loan is shown. Proposed testing procedure based on moving window allows detection of complex, sequen- tial, time based dependencies between particular transactions. Moreover, this approach reduces noise by eliminating irrelevant dependencies that would occur due to the lack of time dimension analysis. The system for identifying customers interested in credit products, based on classification with random forests and deep neural networks is proposed. The promising results of empirical studies prove that the system is able to extract significant patterns from customers historical transfer and transactional data and predict credit purchase likelihood. Our approach, including the testing method, is not limited to banking sector and can be easily transferred and implemented as a general purpose direct marketing campaign system.
Keywords: Consumer credit | Retail banking | Direct marketing | Marketing campaigns | Database marketing | Random forest | Deep learning | Deep belief networks | Data mining | Time series | Feature selection | Boruta algorith
Comparison of machine learning classifiers for differentiation of grade 1 from higher gradings in meningioma: A multicenter radiomics study
مقایسه طبقه بندی کننده های یادگیری ماشین برای تمایز درجه 1 از درجه های بالاتر در مننژیوما: یک مطالعه رادیومتری چند متری-2019
Background and purpose: Advanced imaging analysis for the prediction of tumor biology and modelling of clinically relevant parameters using computed imaging features is part of the emerging field of radiomics research. Here we test the hypothesis that a machine learning approach can distinguish grade 1 from higher gradings in meningioma patients using radiomics features derived from a heterogenous multicenter dataset of multi-paramedic MRI. Methods: A total of 138 patients from 5 international centers that underwent MRI prior to surgical resection of intracranial meningiomas were included. Segmentation was performed manually on co-registered multi-parametric MR images using apparent diffusion coefficient (ADC) maps, T1-weighted (T1), post-contrast T1-weighted (T1c), subtraction maps (Sub, T1c – T1), T2-weighted fluid-attenuated inversion recovery (FLAIR) and T2- weighted (T2) images. Feature selection was performed and using cross-validation to separate training from testing data, four machine learning classifiers were scored on combinations of MRI modalities: random forest (RF), extreme gradient boosting (XGBoost), support vector machine (SVM) and multilayer perceptron (MLP). Results: The best AUC of 0.97 (1.0 and 0.97 for sensitivity and specificity) was observed for the combination of ADC, ADC of the peritumoral edema, T1, T1c, Sub and FLAIR-derived features using only 16 of the 10,914 possible features and XGBoost. Conclusions: Machine learning using radiomics features derived from multi-parametric MRI is capable of high AUC scores with high sensitivity and specificity in classifying meningiomas between low and higher gradings despite heterogeneous protocols across different centers. Feature selection can be performed effectively even when extracting a large amount of data for radiomics fingerprinting
Keywords: Random forest | Support vector machine | Multilayer perceptron | XGBoost | Machine learning | Meningioma | Grading | Feature selection
Glioma stages prediction based on machine learning algorithm combined with protein-protein interaction networks
پیش بینی مراحل گلیوما بر اساس الگوریتم یادگیری ماشین همراه با شبکه های تعامل پروتئین-پروتئین-2019
Background: Glioma is the most lethal nervous system cancer. Recent studies have made great efforts to study the occurrence and development of glioma, but the molecular mechanisms are still unclear. This study was designed to reveal the molecular mechanisms of glioma based on protein-protein interaction network combined with machine learning methods. Key differentially expressed genes (DEGs) were screened and selected by using the protein-protein interaction (PPI) networks. Results: As a result, 19 genes between grade I and grade II, 21 genes between grade II and grade III, and 20 genes between grade III and grade IV. Then, five machine learning methods were employed to predict the gliomas stages based on the selected key genes. After comparison, Complement Naive Bayes classifier was employed to build the prediction model for grade II-III with accuracy 72.8%. And Random forest was employed to build the prediction model for grade I-II and grade III-VI with accuracy 97.1% and 83.2%, respectively. Finally, the selected genes were analyzed by PPI networks, Gene Ontology (GO) terms and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways, and the results improve our understanding of the biological functions of select DEGs involved in glioma growth. We expect that the key genes expressed have a guiding significance for the occurrence of gliomas or, at the very least, that they are useful for tumor researchers. Conclusion: Machine learning combined with PPI networks, GO and KEGG analyses of selected DEGs improve our understanding of the biological functions involved in glioma growth.
Keywords: DEGs | Machine learning | PPI networks | GO | KEGG | SVM | ANN | Random forest | Couple naïve Bayes
Multi-parametric optic disc segmentation using superpixel based feature classification
تقسیم بندی دیسک نوری چند پارامتری با استفاده از طبقه بندی ویژگی های مبتنی بر superpixel-2019
Glaucoma along with diabetic retinopathy is a major cause of vision blindness and is projected to affect over 80 million people by 2020. Recently, expert systems have matched human performance in disease diagnosis and proven to be highly useful in assisting medical experts in the diagnosis and detection of diseases. Hence, automated optic disc detection through intelligent systems is vital for early diagnosis and detection of Glaucoma. This paper presents a multi-parametric optic disk detection and localization method for retinal fundus images using region-based statistical and textural features. Highly discrimina- tive features are selected based on the mutual information criterion and a comparative analysis of four benchmark classifiers: Support Vector Machine, Random Forest (RF), AdaBoost and RusBoost is presented. The results of the proposed RF classifier based pipeline demonstrate its highly competitive performance (accuracies of 0.993, 0.988 and 0.993 on the DRIONS, MESSIDOR and ONHSD databases) with the state- of-the-art, thus making it a suitable candidate for patient management systems for early diagnosis of the Glaucoma.
Keywords: AdaBoostM1 | Glaucoma | RusBoost | Random forest | Support vector machine