DQPFS: Distributed quadratic programming based feature selection for big data
DQPFS: انتخاب ویژگی های مبتنی بر برنامه نویسی درجه دوم برای داده های بزرگ-2020
With the advent of the Big data, the scalability of the machine learning algorithms has become more crucial than ever before. Furthermore, Feature selection as an essential preprocessing technique can improve the performance of the learning algorithms in confront with large-scale dataset by removing the irrelevant and redundant features. Owing to the lack of scalability, most of the classical feature selection algorithms are not so proper to deal with the voluminous data in the Big Data era. QPFS is a traditional feature weighting algorithm that has been used in lots of feature selection applications. By inspiring the classical QPFS, in this paper, a scalable algorithm called DQPFS is proposed based on the novel Apache Spark cluster computing model. The experimental study is performed on three big datasets that have a large number of instances and features at the same time. Then some assessment criteria such as accuracy, execution time, speed-up and scale-out are figured. Moreover, to study more deeply, the results of the proposed algorithm are compared with the classical version QPFS and the DiRelief, a distributed feature selection algorithm proposed recently. The empirical results illustrate that proposed method has (a) better scale-out than DiRelief, (b) significantly lower execution time than DiRelief, (c) lower execution time than QPFS, (d) better accuracy of the Naïve Bayes classifier in two of three datasets than DiRelief.
Keywords: Big data | Apache Spark | Feature selection | Feature ranking | Quadratic programming
Use of support vector machines with a parallel local search algorithm for data classification and feature selection
استفاده از ماشینهای بردار پشتیبانی با الگوریتم جستجوی محلی موازی برای طبقه بندی داده ها و انتخاب ویژگی ها-2020
Over the last decade, the number of studies on machine learning has significantly increased. One of the most widely researched areas of machine learning is data classification. Most big data systems require a large amount of information storage for analytic purposes; however, this involves some disadvantages, such as the costs of processing and collecting data. Thus, many researchers and practitioners are working on effectively reducing the number of features used in classification. This paper proposes a method which jointly optimizes both feature selection and classification. A survey of the relevant literature shows that the vast majority of studies focus on either feature selection or classification. In this study, the proposed parallel local search algorithm both selects features and finds a classifier with high rates of accuracy. Moreover, the proposed method is capable of finding solutions for problems that have extremely high numbers of features within a reasonable computation time.
Keywords: Support vector machines | Feature selection | Classification | Heuristic | Machine learning
Designing a general type-2 fuzzy expert system for diagnosis of depression
طراحی سیستم تخصصی فازی نوع 2 برای تشخیص افسردگی-2019
Depression is a common and important mental disorder that affects the quality of human life. Since people with depression are not aware of their disorder and sometimes suffer from physical symptoms such as chronic pain, refer to a physician instead of a psychologist. Hence, physician’s diagnosis is not always correct in all patients. In the other words, misdiagnosis may occur by mislabeling their mental disorder as physical diseases. Delay in depression diagnosis may have irrecoverable outcomes such as suicide. Therefore, the most challenging aspect of depression diagnosis is to limit time loss and preserve accuracy. In this paper, a novel general type-2 fuzzy expert system for depression diagnosis, considering two main objectives, was developed. These objectives include accuracy of the system and diagnosis time. The proposed system might be a helpful guideline for the physician to lead patients toward psychologist by asking 15 questions from patients. The proposed general Type-2 expert system has five steps. In the first step, we generate general type-2 membership function by using zSlices method and interval agreement approach (IAA). Then fuzzy rules are extracted out of data gathered from hospital and we extend Mendel method briefly in the second step. Approximate reasoning is applied in the third step. In the fourth step, we solve a multi-objective problem to minimize time and maximize accuracy by using MOEA/D method. Accordingly, in order to minimize time, feature selection is applied. In this process, we use MIFS (Mutual Information Feature Selection) method and briefly, we extend it. In the final step, we choose an appropriate solution from achieved Pareto Front (PF). The proposed general type-2 expert system has been tested and evaluated to show its performance. This Intelligent system is able to diagnose depression accurately at a suitable time.
Keywords: Depression Computing with words (CWW) | General type-2 fuzzy sets | zSlices | MOEA/D algorithm | Feature selection | Beck Depression Inventory-II test (BDI-II) | Adaptive system | Expert system
Texture descriptors and voxels for the early diagnosis of Alzheimer’s disease
توصیف کنندگان بافت و وکسل ها برای تشخیص زودرس بیماری آلزایمر-2019
Background and objective: Early and accurate diagnosis of Alzheimers Disease (AD) is critical since early treatment effectively slows the progression of the disease thereby adding productive years to those afflicted by this disease. A major problem encountered in the classification of MRI for the automatic diagnosis of AD is the socalled curse-of-dimensionality, which is a consequence of the high dimensionality of MRI feature vectors and the low number of training patterns available in most MRI datasets relevant to AD. Methods: A method for performing early diagnosis of AD is proposed that combines a set of SVMs trained on different texture descriptors (which reduce dimensionality) extracted from slices of Magnetic Resonance Image (MRI) with a set of SVMs trained on markers built from the voxels of MRIs. The dimension of the voxel-based features is reduced by using different feature selection algorithms, each of which trains a separate SVM. These two sets of SVMs are then combined by weighted-sum rule for a final decision. Results: Experimental results show that 2D texture descriptors improve the performance of state-of-the-art voxelbased methods. The evaluation of our system on the four ADNI datasets demonstrates the efficacy of the proposed ensemble and demonstrates a contribution to the accurate prediction of AD. Conclusions: Ensembles of texture descriptors combine partially uncorrelated information with respect to standard approaches based on voxels, feature selection, and classification by SVM. In other words, the fusion of a system based on voxels and an ensemble of texture descriptors enhances the performance of voxel-based approaches.
Keywords: Alzheimer’s disease | Ensemble of classifiers | Pattern recognition | Feature selection
Machine learning-based coronary artery disease diagnosis: A comprehensive review
تشخیص بیماری عروق کرونر مبتنی بر یادگیری ماشین: یک مرور جامع-2019
Coronary artery disease (CAD) is the most common cardiovascular disease (CVD) and often leads to a heart attack. It annually causes millions of deaths and billions of dollars in financial losses worldwide. Angiography, which is invasive and risky, is the standard procedure for diagnosing CAD. Alternatively, machine learning (ML) techniques have been widely used in the literature as fast, affordable, and noninvasive approaches for CAD detection. The results that have been published on ML-based CAD diagnosis differ substantially in terms of the analyzed datasets, sample sizes, features, location of data collection, performance metrics, and applied ML techniques. Due to these fundamental differences, achievements in the literature cannot be generalized. This paper conducts a comprehensive and multifaceted review of all relevant studies that were published between 1992 and 2019 for ML-based CAD diagnosis. The impacts of various factors, such as dataset characteristics (geographical location, sample size, features, and the stenosis of each coronary artery) and applied ML techniques (feature selection, performance metrics, and method) are investigated in detail. Finally, the important challenges and shortcomings of ML-based CAD diagnosis are discussed.
Keywords: CAD diagnosis | Machine learning | Data mining | Feature selection
A new machine learning technique for an accurate diagnosis of coronary artery disease
یک روش جدید یادگیری ماشین برای تشخیص دقیق بیماری عروق کرونر-2019
Background and objective: Coronary artery disease (CAD) is one of the commonest diseases around the world. An early and accurate diagnosis of CAD allows a timely administration of appropriate treatment and helps to reduce the mortality. Herein, we describe an innovative machine learning methodology that enables an accurate detection of CAD and apply it to data collected from Iranian patients. Methods: We first tested ten traditional machine learning algorithms, and then the three-best perform- ing algorithms (three types of SVM) were used in the rest of the study. To improve the performance of these algorithms, a data preprocessing with normalization was carried out. Moreover, a genetic algorithm and particle swarm optimization, coupled with stratified 10-fold cross-validation, were used twice: for optimization of classifier parameters and for parallel selection of features. Results: The presented approach enhanced the performance of all traditional machine learning algorithms used in this study. We also introduced a new optimization technique called N2Genetic optimizer (a new genetic training). Our experiments demonstrated that N2Genetic-nuSVM provided the accuracy of 93.08% and F1-score of 91.51% when predicting CAD outcomes among the patients included in a well-known Z-Alizadeh Sani dataset. These results are competitive and comparable to the best results in the field. Conclusions: We showed that machine-learning techniques optimized by the proposed approach, can lead to highly accurate models intended for both clinical and research use.
Keywords: Coronary artery disease (CAD) | Machine learning | Normalization | Genetic algorithm | Particle swarm optimization | Feature selection | Classification
Machine learning methods for MRI biomarkers analysis of pediatric posterior fossa tumors
روشهای یادگیری ماشین برای تحلیل نشانگرهای زیستی MRI از تومورهای حفره ای کودکان-2019
Medical imaging technologies provide an increasing number of opportunities for disease prediction and prognosis. Specifically, imaging biomarkers can quantify the entire tumor phenotypes to enhance the prediction. Machine learning technology can be explored to mine and analyze these biomarkers and to establish predictive models for the clinical applications. Several studies have applied various machine learning methods to imaging biomarkers based clinical predictions of different diseases. Here we seek to evaluate different machine learning methods in pediatric posterior fossa tumor prediction. We present a machine learning based magnetic resonance imaging biomarkers analysis framework for two kinds of pediatric posterior fossa tumors. In details, three feature extraction methods are used to obtain 300 imaging biomarkers. 10 feature selection methods and 11 classifiers are evaluated by the quantified predictive performance and stability, and importance consistency of features and the influence of the experimental factors are also analyzed. Our results demonstrate that the CFS feature selection method (accuracy: 83.85 5.51%, stability: [0.84, 0.06]) and SVM classifier (accuracy: 85.38 3.47%, RSD: 4.77%) show relatively better performance than others and should be preferred. Among all the biomarkers, 17 texture features seem to be more important. Multifactor analysis results indicate the choice of classifier accounts for the most contribution to the variability in performance (37.25%). The machine learning based framework is efficient for pediatric posterior fossa tumors biomarkers analysis and could provide valuable references and decision support for assisted clinical diagnosis.
Keywords: Pediatric posterior fossa tumor | Magnetic resonance imaging | Biomarker | Machine learning | Feature selection | Classifier
Determining relevant biomarkers for prediction of breast cancer using anthropometric and clinical features: A comparative investigation in machine learning paradigm
تعیین نشانگرهای زیستی مربوط به پیش بینی سرطان پستان با استفاده از خصوصیات آنتروپومتریک و کلینیکی: بررسی مقایسه ای در پارادایم یادگیری ماشین-2019
Early detection of breast cancer plays crucial role in planning and result of associated treatment. The purpose of this article is threefold: (i) to investigate whether or not clinical features obtained using routine blood analysis combined with anthropometric measure- ments can be utilized for envisaging breast cancer using predictive machine learning techniques; (ii) to explore the role of various machine learning components such as feature selection, data division protocols and classification to determine suitable biomarkers for breast cancer prediction; and (iii) to evaluate a recent database of clinical and anthropometric measurements acquired from normal individuals and individuals suffering from breast cancer. A database consisting of anthropometric and clinical attributes is used in the experiments. Various feature selection and statistical significance analysis methods are used to determine the relevance of various features. Furthermore, popular classifiers such as kernel based support vector machine (SVM), Naïve Bayesian, linear discriminant, quadratic discriminant, logistic regression, K-nearest neighbor (K-NN) and random forest were implemented and evaluated for breast cancer risk prediction using these features. Results of feature selection techniques indicate that among the nine features considered in this study, glucose, age and resistin are found to be most relevant and effective biomarkers for breast cancer prediction. Further, when these three features are used for classification, the medium K-NN classifier achieves the highest classification accuracy of 92.105% followed by medium Gaussian SVM which achieves classification accuracy of 83.684% under hold out data division protocol.
Keywords: Breast cancer biomarkers | Machine learning | Expert systems | Clinical features | Feature selection
A recommender system for component-based applications using machine learning techniques
یک سیستم توصیه گر برای برنامه های کاربردی مبتنی بر مؤلفه با استفاده از تکنیک های یادگیری ماشین-2019
Software designers are striving to create software that adapts to their users’ requirements. To this end, the development of component-based interfaces that users can compound and customize according to their needs is increasing. However, the success of these applications is highly dependent on the users’ ability to locate the components useful for them, because there are often too many to choose from. We propose an approach to address the problem of suggesting the most suitable components for each user at each moment, by creating a recommender system using intelligent data analysis methods. Once we have gathered the interaction data and built a dataset, we address the problem of transforming an original dataset from a real component-based application to an optimized dataset to apply machine learning algorithms through the application of feature engineering techniques and feature selection methods. Moreover, many aspects, such as contextual information, the use of the application across several devices with many forms of interaction, or the passage of time (components are added or removed over time), are taken into consideration. Once the dataset is optimized, several machine learning algorithms are applied to create recommendation systems. A series of experiments that create recommendation models are conducted applying several machine learning algorithms to the optimized dataset (before and after applying feature selection methods) to determine which recommender model obtains a higher accuracy. Thus, through the deployment of the recommendation system that has better results, the likelihood of success of a component-based application is increased by allowing users to find the most suitable components for them, enhancing their user experience and the application engagement.
Keywords: Machine learning | Recommender systems | Feature engineering | Feature selection | Component-based interfaces | Interaction information acquisition
Comparison of skin disease prediction by feature selection using ensemble data mining techniques
مقایسه پیش بینی بیماری پوستی با انتخاب ویژگی ها با استفاده از تکنیک های داده کاوی گروه-2019
Background: Skin disease is a major problem among people worldwide. Different machine learning techniques can be applied to identify classes of skin disease. Herein, we have applied machine learning algorithms to categorize classes of skin disease using ensemble techniques, and then a feature selection method is utilized to compare the results obtained. Method: In the proposed study, we present a new method which applies six different data mining classification techniques, and then develop an ensemble approach using Bagging, AdaBoost and Gradient Boosting classifier techniques to predict classes of skin disease. Furthermore, a feature importance method is utilized to select the most salient 15 features which will play a major role in prediction. A subset of the original dataset is obtained after selecting the 15 features, to compare the results of six machine learning techniques, and an ensemble approach is applied to the entire dataset. Results: The ensemble method is compared with the subset obtained from the feature selection method. The outcome shows that the dermatological prediction accuracy of the test dataset is increased as compared to the use of an individual classifier, and improved accuracy is obtained as compared with the feature selection subset method. Conclusion: The ensemble method and feature selection applied to dermatology datasets yields a better performance as compared to individual classifier algorithms. The ensemble method provides a more accurate and effective skin disease prediction.
Keywords: Skin disease | Dermatology | Extra tree classifier | Radius neighbors classifier | Passive aggressive classifier