Data Mining Strategies for Real-Time Control in New York City
استراتژی داده کاوی برای کنترل زمان واقعی در شهر نیویورک-2105
The Data Mining System (DMS) at New York City Department of Transportation (NYCDOT) mainly consists of four database systems for traffic and pedestrian/bicycle volumes, crash data, and signal timing plans as well as the Midtown in Motion (MIM) systems which are used as part of the NYCDOT Intelligent Transportation System (ITS) infrastructure. These database and control systems are operated by different units at NYCDOT as an independent database or operation system. New York City experiences heavy traffic volumes, pedestrians and cyclists in each Central Business District (CBD) area and along key arterial systems. There are consistent and urgent needs in New York City for real-time control to improve mobility and safety for all users of the street networks, and to provide a timely response and management of random incidents. Therefore, it is necessary to develop an integrated DMS for effective real-time control and active transportation management (ATM) in New York City. This paper will present new strategies for New York City suggesting the development of efficient and cost-effective DMS, involving: 1) use of new technology applications such as tablets and smartphone with Global Positioning System (GPS) and wireless communication features for data collection and reduction; 2) interface development among existing database and control systems; and 3) integrated DMS deployment with macroscopic and mesoscopic simulation models in Manhattan. This study paper also suggests a complete data mining process for real-time control with traditional static data, current real timing data from loop detectors, microwave sensors, and video cameras, and new real-time data using the GPS data. GPS data, including using taxi and bus GPS information, and smartphone applications can be obtained in all weather conditions and during anytime of the day. GPS data and smartphone application in NYCDOT DMS is discussed herein as a new concept. © 2014 The Authors. Published by Elsevier B.V. Selection and peer-review under responsibility of Elhadi M. Shakshu Keywords: Data Mining System (DMS), New York City, real-time control, active transportation management (ATM), GPS data
Physical metallurgy-guided machine learning and artificial intelligent design of ultrahigh-strength stainless steel
یادگیری ماشین با هدایت متالورژی فیزیکی و طراحی هوشمند مصنوعی از فولاد ضد زنگ قوی-2019
With the development of the materials genome philosophy and data mining methodologies, machine learning (ML) has been widely applied for discovering new materials in various systems including highend steels with improved performance. Although recently, some attempts have been made to incorporate physical features in the ML process, its effects have not been demonstrated and systematically analysed nor experimentally validated with prototype alloys. To address this issue, a physical metallurgy (PM) -guided ML model was developed, wherein intermediate parameters were generated based on original inputs and PM principles, e.g., equilibrium volume fraction (Vf) and driving force (Df) for precipitation, and these were added to the original dataset vectors as extra dimensions to participate in and guide the ML process. As a result, the ML process becomes more robust when dealing with small datasets by improving the data quality and enriching data information. Therefore, a new material design method is proposed combining PM-guided ML regression, ML classifier and a genetic algorithm (GA). The model was successfully applied to the design of advanced ultrahigh-strength stainless steels using only a small database extracted from the literature. The proposed prototype alloy with a leaner chemistry but better mechanical properties has been produced experimentally and an excellent agreement was obtained for the predicted optimal parameter settings and the final properties. In addition, the present work also clearly demonstrated that implementation of PM parameters can improve the design accuracy and efficiency by eliminating intermediate solutions not obeying PM principles in the ML process. Furthermore, various important factors influencing the generalizability of the ML model are discussed in detail.
Keywords: Alloy design | Machine learning | Physical metallurgy | Small sample problem | Stainless steel
Combining hierarchical clustering approaches using the PCA method
ترکیب روشهای خوشه بندی سلسله مراتبی با استفاده از روش PCA-2019
In expert systems, data mining methods are algorithms that simulate humans’ problem-solving capabil- ities. Clustering methods as unsupervised machine learning methods are crucial approaches to catego- rize similar samples in the same categories. The use of different clustering algorithms to a given dataset produces clusters with different qualities. Hence, many researchers have applied clustering combination methods to reduce the risk of choosing an inappropriate clustering algorithm. In these methods, the out- puts of several clustering algorithms are combined. In these research works, the input hierarchical clus- terings are transformed to descriptor matrices and their combination is achieved by aggregating their descriptor matrices. In previous works, only element-wise aggregation operators have been used and the relation between the elements of each descriptor matrix has been ignored. However, the value of each element of the descriptor matrix is meaningful in comparison with its other elements. The current study proposes a novel method of combining hierarchical clustering approaches based on principle component analysis (PCA). PCA as an aggregator allows considering all elements of the descriptor matrices. In the proposed approach, basic clusters are made and transformed to descriptor matrices. Then, a final ma- trix is extracted from the descriptor matrices using PCA. Next, a final dendrogram is constructed from the matrix that is used to summarize the results of the diverse clustering. The experimental results on popular available datasets show the superiority of the clustering accuracy of the proposed method over basic clustering methods such as single, average and centroid linkage and previously combined hierar- chical clustering methods. In addition, statistical tests show that the proposed method significantly out- performed hierarchical clustering combination methods with element-wise averaging operators in almost all tested datasets. Several experiments have also been conducted which confirm the robustness of the proposed method for its parameter setting.
Keywords: Clustering | Hierarchical clustering | Principle component analysis | PCA
A Cryptographic Ensemble for secure third party data analysis: Collaborative data clustering without data owner participation
یک گروه رمزنگاری برای تجزیه و تحلیل داده های شخص ثالث امن: خوشه بندی داده های مشارکتی بدون مشارکت صاحب داده-2019
This paper introduces the twin concepts Cryptographic Ensembles and Global Encrypted Distance Matrices (GEDMs), designed to provide a solution to outsourced secure collaborative data clustering. The cryptographic ensemble comprises: Homomorphic Encryption (HE) to preserve raw data privacy, while supporting data analytics; and Multi-User Order Preserving Encryption (MUOPE) to preserve the privacy of the GEDM. Clustering can therefore be conducted over encrypted datasets without requiring decryption or the involvement of data owners once encryption has taken place, all with no loss of accuracy. The GEDM concept is applicable to large scale collaborative data mining applications that feature horizontal data partitioning. In the paper DBSCAN clustering is adopted for illustrative and evaluation purposes. The results demonstrate that the proposed solution is both efficient and accurate while maintaining data privacy.
Keywords: Data mining as a service | Privacy preserving data mining | Security | Data outsourcing
Machine Learning Techniques for Satellite Fault Diagnosis
تکنیک های یادگیری ماشین برای تشخیص عیب ماهواره ای-2019
Satellites are known as a remotely operated systems with high degree of complexity due to large number of interconnected devices onboard the satellite. Consequently, it has corresponding significant number of telemetry parameters to allow operator and designers have full control and monitor of satellite mode of operation. The tremendous amount of telemetry data received from the satellite, during its lifetime, has to be analyzed in order to monitor and control subsystems health for better decision making and fast responsively. In this research, we address the topic of using machine learning techniques to diagnose faults of satellite subsystems using its telemetry parameters. The case study and source of telemetry are acquired from Egyptsat-1 satellite which has been launched April 2007 and lost communication with ground station last 2010. We applied Machine learning techniques in order to identify operating modes and corresponding telemetry parameters. We used Support Vector Machine for Regression to analyze the satellite performance; then a fault diagnosis approach is applied to determine the most probable reason of this satellite failure. Telemetry data is clustered using k-means clustering algorithm in combination with t-distributed stochastic neighbor embedding (t-SNE) function for dimensionality reduction. We classified data using Logical Analysis of Data (LAD) in order to generate positive patterns for each failure class which is used to determine probability failure cause for each telemetry parameter. These probabilities enable Fault Tree Analysis (FTA) to get the most probable cause that lead to satellite failure.
Keywords: Machine learning | Telemetry data mining | Satellite fault diagnosis | Logical analysis of data | Fault tree analysis
Determination of the Blood, Hormone and Obesity Value Ranges that Indicate the Breast Cancer, Using Data Mining Based Expert System
تعیین محدوده ارزش خون ، هورمون و چاقی که نشان دهنده سرطان پستان است ، با استفاده از سیستم خبره مبتنی بر داده کاوی-2019
Breast cancer is a dangerous type of cancer that spreads into other organs over time. Therefore, medical studies are being done for the early diagnosis by means of the anthropometric data and blood analysis values besides the mammographic and histological findings. However, medical studies have identified only cancer-related values but the value ranges indicating the cancer have not been determined yet. Concurrently the automated diagnostic systems are being developed to assist medical specialists in biomedical engineering studies. The range of values or boundaries indicating the cancer are automatically determined in biomedical methods, but only the diagnostic result is presented. Because of this, biomedical studies don’t provide enough opportunity for medical experts to evaluate the relationship between values and result. In this study, decision trees that is one of data mining method was applied to anthropometric data and blood analysis values to complete the mentioned deficiencies in breast cancer diagnosis aiming studies. The determined value ranges were also presented visually to medical experts understand them easily. The proposed diagnostic system has accuracy rate up to 90.52% and provides value ranges indicating the breast cancer as well as mathematically presents the relations between the values and cancer.
Keywords: Breast cancer | Data mining | Obesity | Hormone | Blood analysis
Data analysis of multi-dimensional thermophysical properties of liquid substances based on clustering approach of machine learning
تجزیه و تحلیل داده ها از خصوصیات حرارتی فیزیکی چند بعدی مواد مایع بر اساس روش خوشه بندی یادگیری ماشین-2019
In order to develop an efficient framework for global screening in the material exploration, we performed a clustering analysis of machine learning on the multi-dimensional thermophysical properties of the liquid substances. Data mining using a self-organizing map (SOM) based on the unsupervised learning was employed to project high-dimensional thermophysical data onto a low-dimensional space. Here we adopted 98 liquid substances with eight thermo-physical properties for the SOM training in order to group the liquid substances. The present SOM-clustering approach properly categorized liquid substances according to the chemical species characterized by the functional groups.
Keywords: Self-organizing map | Clustering analysis | Machine learning | Thermophysical properties | Heat medium
Machine learning-based coronary artery disease diagnosis: A comprehensive review
تشخیص بیماری عروق کرونر مبتنی بر یادگیری ماشین: یک مرور جامع-2019
Coronary artery disease (CAD) is the most common cardiovascular disease (CVD) and often leads to a heart attack. It annually causes millions of deaths and billions of dollars in financial losses worldwide. Angiography, which is invasive and risky, is the standard procedure for diagnosing CAD. Alternatively, machine learning (ML) techniques have been widely used in the literature as fast, affordable, and noninvasive approaches for CAD detection. The results that have been published on ML-based CAD diagnosis differ substantially in terms of the analyzed datasets, sample sizes, features, location of data collection, performance metrics, and applied ML techniques. Due to these fundamental differences, achievements in the literature cannot be generalized. This paper conducts a comprehensive and multifaceted review of all relevant studies that were published between 1992 and 2019 for ML-based CAD diagnosis. The impacts of various factors, such as dataset characteristics (geographical location, sample size, features, and the stenosis of each coronary artery) and applied ML techniques (feature selection, performance metrics, and method) are investigated in detail. Finally, the important challenges and shortcomings of ML-based CAD diagnosis are discussed.
Keywords: CAD diagnosis | Machine learning | Data mining | Feature selection
An efficient manifold regularized sparse non-negative matrix factorization model for large-scale recommender systems on GPUs
یک مدل فاکتور گیری ماتریس غیر منفی خلوت منظم شده چند ظرفیتی کارا برای سیستمهای توصیه گر در مقیاس بزرگ بر روی GPU-2019
Article history:Received 31 January 2018Revised 1 July 2018Accepted 25 July 2018Available online 27 July 2018Keywords:Collaborative ﬁltering recommender systemsData miningEuclidean distance and KL-divergence GPU parallelizationManifold regularizationNon-negative matrix factorizationNon-negative Matrix Factorization (NMF) plays an important role in many data mining ap- plications for low-rank representation and analysis. Due to the sparsity that is caused by missing information in many high-dimension scenes, e.g., social networks or recommender systems, NMF cannot mine a more accurate representation from the explicit information. Manifold learning can incorporate the intrinsic geometry of the data, which is combined with a neighborhood with implicit information. Thus, manifold-regularized NMF (MNMF) can realize a more compact representation for the sparse data. However, MNMF suffers from (a) the forming of large-scale Laplacian matrices, (b) frequent large-scale matrix ma- nipulation, and (c) the involved K-nearest neighbor points, which will result in the over- writing problem in parallelization. To address these issues, a single-thread-based MNMF model is proposed on two types of divergence, i.e., Euclidean distance and Kullback–Leibler (KL) divergence, which depends only on the involved feature-tuples’ multiplication and summation and can avoid large-scale matrix manipulation. Furthermore, this model can remove the dependence among the feature vectors with ﬁne-grain parallelization inher- ence. On that basis, a CUDA parallelization MNMF (CUMNMF) is presented on GPU com- puting. From the experimental results, CUMNMF achieves a 20X speedup compared with MNMF, as well as a lower time complexity and space requirement.© 2018 Published by Elsevier Inc.
Keywords: Collaborative filtering recommender systems | Data mining | Euclidean distance and KL-divergence | GPU parallelization | Manifold regularization | Non-negative matrix factorization
Machine learning and data mining frameworks for predicting drug response in cancer: An overview and a novel in silico screening process based on association rule mining
چارچوب های یادگیری ماشین و داده کاوی برای پیش بینی پاسخ به دارو در سرطان: یک مرور کلی و رمان در فرآیند غربالگری سیلیکون بر اساس کاوش قوانین انجمنی-2019
A major challenge in cancer treatment is predicting the clinical response to anti-cancer drugs on a personalized basis. The success of such a task largely depends on the ability to develop computational resources that integrate big “omic” data into effective drug-response models. Machine learning is both an expanding and an evolving computational field that holds promise to cover such needs. Here we provide a focused overview of: 1) the various supervised and unsupervised algorithms used specifically in drug response prediction applications, 2) the strategies employed to develop these algorithms into applicable models, 3) data resources that are fed into these frameworks and 4) pitfalls and challenges tomaximizemodel performance. In this contextwe also describe a novel in silico screening process, based on Association RuleMining, for identifying genes as candidate drivers of drug response and compare it with relevant data mining frameworks, for which we generated a web application freely available at: https://compbio.nyumc.org/drugs/. This pipeline explores with high efficiency large samplespaces, while is able to detect low frequency events and evaluate statistical significance even in the multidimensional space, presenting the results in the form of easily interpretable rules. We conclude with future prospects and challenges of applying machine learning based drug response prediction in precision medicine.
Key words: Drug Response Prediction | Precision Medicine | Data mining | Machine Learning | Association Rule Mining