Deep Representation Learning for Individualized Treatment Effect Estimation using Electronic Health Records
یادگیری بازنمایی عمیق برای ارزیابی اثر درمانی شخصی با استفاده از سوابق الکترونیکی بهداشت-2019
Utilizing clinical observational data to estimate individualized treatment effects (ITE) is a challenging task, as confounding inevitably exists in clinical data. Most of the existing models for ITE estimation tackle this problem by creating unbiased estimators of the treatment effects. Although valuable, learning a balanced representation is sometimes directly opposed to the objective of learning an effective and discriminative model for ITE estimation. We propose a novel hybrid model bridging multi-task deep learning and K-nearest neighbors (KNN) for ITE estimation. In detail, the proposed model firstly adopts multi-task deep learning to extract both outcome-predictive and treatment-specific latent representations from Electronic Health Records (EHR), by jointly performing the outcome prediction and treatment category classification. Thereafter, we estimate counterfactual outcomes by KNN based on the learned hidden representations. We validate the proposed model on a widely used semi-simulated dataset, i.e. IHDP, and a real-world clinical dataset consisting of 736 heart failure (HF) patients. The performance of our model remains robust and reaches 1.7 and 0.23 in terms of Precision in the estimation of heterogeneous effect (PEHE) and average treatment effect (ATE), respectively, on IHDP dataset, and 0.703 and 0.796 in terms of accuracy and F1 score respectively, on HF dataset. The results demonstrate that the proposed model achieves competitive performance over state-of-the-art models. In addition, the results reveal several findings which are consistent with existing medical domain knowledge, and discover certain suggestive hypotheses that could be validated through further investigations in the clinical domain.
Keywords: Individualized Treatment Effect Estimation | Counterfactual Inference | Deep Representation Learning | Multi-task Learning | K-Nearest Neighbors
A machine learning algorithm for high throughput identification of FTIR spectra: Application on microplastics collected in the Mediterranean Sea
A machine learning algorithm for high throughput identification of FTIR spectra: Application on microplastics collected in the Mediterranean Sea-2019
The development of methods to automatically determine the chemical nature of microplastics by FTIRATR spectra is an important challenge. A machine learning method, named k-nearest neighbors classification, has been applied on spectra of microplastics collected during Tara Expedition in the Mediterranean Sea (2014). To realize these tests, a learning database composed of 969 microplastic spectra has been created. Results show that the machine learning process is very efficient to identify spectra of classical polymers such as poly(ethylene), but also that the learning database must be enhanced with less common microplastic spectra. Finally, this method has been applied on more than 4000 spectra of unidentified microplastics. The verification protocol showed less than 10% difference in the results between the proposed automated method and a human expertise, 75% of which can be very easily corrected.
Keywords: Microplastic | Tara mediterranean campaign | FTIR spectra | Machine learning | k-nearest neighbor classification
Analysis of operating system identification via fingerprinting and machine learning
تجزیه و تحلیل شناسایی سیستم عامل از طریق اثر انگشت و یادگیری ماشین-2019
In operating system (OS) fingerprinting, the OS is identified using network packets and a rule-based matching method. However, this matching method has problems when the network packet information is insufficient or the OS is relatively new. This study com- pares the OS identification capabilities of several machine learning methods, specifically, K-nearest neighbors (K-NN), Decision Tree, and Artificial Neural Network (ANN), to that of a conventional commercial rule-based method. It is shown that the ANN correctly iden- tifies operating systems with 94% probability, which is higher than the accuracy of the conventional rule-based method.
Keywords: Operating system fingerprinting | Machine learning | Artificial Neural Network | NetworkMiner | K-nearest Neighbors | Decision Tree
On the application of machine learning techniques to derive seismic fragility curves
استفاده از روش های یادگیری ماشین برای استنتاج منحنی های شکنندگی لرزه ای-2019
Deriving the fragility curves is a key step in seismic risk assessment within the performance-based earthquake engineering framework. The objective of this study is to implement machine learning tools (i.e., classification-based tools in particular) for predicting the structural responses and the fragility curves. In this regard, ten different classification-based methods are explored: logistic regression, lasso regression, support vector machine, Naïve Bayes, decision tree, random forest, linear and quadratic discriminant analyses, neural networks, and K-nearest neighbors with the structural responses resulted from the multiple strip analyses. In addition, this study examines the impact of class imbalance in training dataset, which is typical among data of structural responses, when developing classification-based models for predicting structural responses. The statistical results using the implemented dataset demonstrate that among applied methods, random forest and quadratic discriminant analysis are, respectively, preferable with the imbalanced and balanced datasets since they show the highest efficiency in predicting the structural responses. Moreover, a detailed procedure is presented on how to derive the fragility curves based on the classification-based tools. Finally, the sensitivity of the applied machine learning methods to the size of employed dataset is investigated. The results explain that logistic regression, lasso regression, and Naïve Bayes are not sensitive to the size of dataset (i.e., the number of performed time history analyses); while the performance of discriminant analysis significantly depends on the size of applied dataset
Keywords: Fragility curve | Machine learning tools | Imbalanced dataset | Random forest | Support vector machine | Multiple strip analysis
Automating orthogonal defect classification using machine learning algorithms
خودکارسازی طبقه بندی نقص متعامد با استفاده از الگوریتم های یادگیری ماشین-2019
Software systems are increasingly being used in business or mission critical scenarios, where the presence of certain types of software defects, i.e., bugs, may result in catastrophic consequences (e.g., financial losses or even the loss of human lives). To deploy systems in which we can rely on, it is vital to understand the types of defects that tend to affect such systems. This allows developers to take proper action, such as adapting the development process or redirecting testing efforts (e.g., using a certain set of testing techniques, or focusing on certain parts of the system). Orthogonal Defect Classification (ODC) has emerged as a popular method for classifying software defects, but it requires one or more experts to categorize each defect in a quite complex and time-consuming process. In this paper, we evaluate the use of machine learning algorithms (k-Nearest Neighbors, Support Vector Machines, Naïve Bayes, Nearest Centroid, Random Forest and Recurrent Neural Networks) for automatic classification of software defects using ODC, based on unstructured textual bug reports. Experimental results reveal the difficulties in automatically classifying certain ODC attributes solely using reports, but also suggest that the overall classification accuracy may be improved in most of the cases, if larger datasets are used.
Index Terms : Software Defects | Bug Reports | Orthogonal Defect Classification | Machine Learning | Text Classification
Failure detection in robotic arms using statistical modeling, machine learning and hybrid gradient boosting
تشخیص عدم موفقیت در بازوهای روباتیک با استفاده از مدل سازی آماری ، یادگیری ماشین و تقویت شیب ترکیبی-2019
Modeling and failure prediction are important tasks in many engineering systems. For these tasks, the machine learning literature presents a large variety of models such as classification trees, random forest, artificial neural networks, among others. Standard statistical models such as the logistic regression, linear discriminant analysis, k-nearest neighbors, among others, can be applied. This work evaluates advantages and limitations of statistical and machine learning methods to predict failures in industrial robots. The work is based on data from more than five thousand robots in industrial use. Furthermore, a new approach combining standard statistical and machine learning models, named hybrid gradient boosting, is proposed. Results show that the hybrid gradient boosting achieves significant improvement as compared to statistical and machine learning methods. Furthermore, local joint information has been identified as the main driver for failure detection, whereas failure classification can be improved using additional information from different joints and hybrid models.
Keywords: Statistical modeling | Machine learning | Gradient boosting
Machine learning powered software for accurate prediction of biogas production: A case study on industrial-scale Chinese production data
نرم افزار طراحی شده توسط ماشین یادگیری برای پیش بینی دقیق تولید بیوگاز: یک مطالعه موردی در مورد داده های تولید چینی در مقیاس صنعتی-2019
The search for appropriate models for predictive analytics is currently a high priority to optimize anaerobic fermentation processes in industrial-scale biogas facilities; operational productivity could be enhanced if project operators used the latest tools in machine learning to inform decision-making. The objective of this study is to enhance biogas production in industrial facilities by designing a graphical user interface to machine learning models capable of predicting biogas output given a set of waste inputs. The methodology involved applying predictive algorithms to daily production data from two major Chinese biogas facilities in order to understand the most important inputs affecting biogas production. The machine learning models used included logistic regression, support vector machine, random forest, extreme gradient boosting, and k-nearest neighbors regression. The models were tuned and crossvalidated for optimal accuracy. Our results showed that: (1) the KNN model had the highest model accuracy for the Hainan biogas facility, with an 87% accuracy on the test set; (2) municipal fecal residue, kitchen food waste, percolate, and chicken litter were inputs that maximized biogas production; (3) an online web-tool based on the machine learning models was developed to enhance the analytical capabilities of biogas project operators; (4) an online waste resource mapping tool was also developed for macro-level project location planning. This research has wide implications for biogas project operators seeking to enhance facility performance by incorporating machine learning into the analytical pipeline.
Keywords: Biogas | Machine learning | China | Graphical user interface
Machine Learning to Differentiate T2-Weighted Hyperintense Uterine Leiomyomas from Uterine Sarcomas by Utilizing Multiparametric Magnetic Resonance Quantitative Imaging Features
یادگیری ماشین برای تمایز لیپوماتیک رحمی T2 با وزنی T2 با استفاده از ویژگیهای تصویربرداری رزونانس مغناطیسی چند پارامتری مغناطیسی از سارکوم رحمی-2019
Rationale and Objective: Uterine leiomyomas with high signal intensity on T2-weighted imaging (T2WI) can be difficult to distinguish from sarcomas. This study assessed the feasibility of using machine learning to differentiate uterine sarcomas from leiomyomas with high signal intensity on T2WI on multiparametric magnetic resonance imaging. Materials and Methods: This retrospective study included 80 patients (50 with benign leiomyoma and 30 with uterine sarcoma) who underwent pelvic 3 T magnetic resonance imaging examination for the evaluation of uterine myometrial smooth muscle masses with high signal intensity on T2WI. We used six machine learning techniques to develop prediction models based on 12 texture parameters on T1WI and T2WI, apparent diffusion coefficient maps, and contrast-enhanced T1WI, as well as tumor size and age. We calculated the areas under the curve (AUCs) using receiver-operating characteristic analysis for each model by 10-fold cross-validation and compared these to those for two board-certified radiologists. Results: The eXtreme Gradient Boosting model gave the highest AUC (0.93), followed by the random forest, support vector machine, multilayer perceptron, k-nearest neighbors, and logistic regression models. Age was the most important factor for differentiation (leiomyoma 44.9 § 11.1 years; sarcoma 58.9 § 14.7 years; p < 0.001). The AUC for the eXtreme Gradient Boosting was significantly higher than those for both radiologists (0.93 vs 0.80 and 0.68, p = 0.03 and p < 0.001, respectively). Conclusion: Machine learning outperformed experienced radiologists in the differentiation of uterine sarcomas from leiomyomas with high signal intensity on T2WI.
Key Words: Magnetic resonance imaging | Uterine neoplasm | Leiomyoma | Machine learning | Sarcoma
Batch-based active learning: Application to social media data for crisis management
یادگیری فعال مبتنی بر دسته: کاربرد در داده های رسانه های اجتماعی برای مدیریت بحران -2018
Classification of evolving data streams is a challenging task, which is suitably tackled with online learn ing approaches. Data is processed instantly requiring the learning machinery to (self-)adapt by adjusting its model. However for high velocity streams, it is usually difficult to obtain labeled samples to train the classification model. Hence, we propose a novel online batch-based active learning algorithm (OBAL) to perform the labeling. OBAL is developed for crisis management applications where data streams are generated by the social media community. OBAL is applied to discriminate relevant from irrelevant so cial media items. An emergency management user will be interactively queried to label chosen items. OBAL exploits the boundary items for which it is highly uncertain about their class and makes use of two classifiers: k-Nearest Neighbors (kNN) and Support Vector Machine (SVM). OBAL is equipped with a labeling budget and a set of uncertainty strategies to identify the items for labeling. An extensive analy sis is carried out to show OBAL’s performance, the sensitivity of its parameters, and the contribution of the individual uncertainty strategies. Two types of datasets are used: synthetic and social media datasets related to crises. The empirical results illustrate that OBAL has a very good discrimination power.
Keywords: Online learning ، Active learning ، Classification ، Social media ، Crisis management
CHI-PG: A fast prototype generation algorithm for Big Data classification problems
CHI-PG: یک الگوریتم تولید سریع نمونه اولیه برای مسائل طبقه بندی داده های-2018
The growing amount of available data has become a serious challenge to data mining and machine learn ing techniques. Well-known classification methods that have been widely applied so far are no longer feasible in Big Data environments. For this reason, prototype reduction techniques (both selection and generation) come up as a candidate solution to build a reduced version of the dataset that speeds up the execution of algorithms such as k-Nearest Neighbors and overcome their memory constraints. However, these solutions generally have a quadratic O(N2) time complexity and share similar limitations to those encountered in data mining and machine learning algorithms in terms of time and memory requirements. In order to overcome these limitations, we introduce a new distributed MapReduce prototype generation method called CHI-PG that provides a linear O(N) time complexity and ensures constant accuracy regard less of the degree of parallelism. This approach builds prototypes by applying a simple scheme based on the rule generation process of the Chi et al. Fuzzy Rule-Based Classification System and takes advantage of the suitability of this classifier for the MapReduce paradigm. The empirical study shows that our new approach significantly improves the execution time of a state-of-the-art distributed prototype reduction algorithm (MRPR) without decreasing (and even improving) classification accuracy and reduction rates. Moreover, CHI-PG has been shown to be a candidate solution to the time and memory constraints of k-Nearest Neighbors when tackling large-scale datasets.
Keywords: Prototype reduction ، Prototype generation ، Big Data ، MapReduce ، Fuzzy Rule-Based Classification Systems