Data Mining Strategies for Real-Time Control in New York City
استراتژی داده کاوی برای کنترل زمان واقعی در شهر نیویورک-2105
The Data Mining System (DMS) at New York City Department of Transportation (NYCDOT) mainly consists of four database systems for traffic and pedestrian/bicycle volumes, crash data, and signal timing plans as well as the Midtown in Motion (MIM) systems which are used as part of the NYCDOT Intelligent Transportation System (ITS) infrastructure. These database and control systems are operated by different units at NYCDOT as an independent database or operation system. New York City experiences heavy traffic volumes, pedestrians and cyclists in each Central Business District (CBD) area and along key arterial systems. There are consistent and urgent needs in New York City for real-time control to improve mobility and safety for all users of the street networks, and to provide a timely response and management of random incidents. Therefore, it is necessary to develop an integrated DMS for effective real-time control and active transportation management (ATM) in New York City. This paper will present new strategies for New York City suggesting the development of efficient and cost-effective DMS, involving: 1) use of new technology applications such as tablets and smartphone with Global Positioning System (GPS) and wireless communication features for data collection and reduction; 2) interface development among existing database and control systems; and 3) integrated DMS deployment with macroscopic and mesoscopic simulation models in Manhattan. This study paper also suggests a complete data mining process for real-time control with traditional static data, current real timing data from loop detectors, microwave sensors, and video cameras, and new real-time data using the GPS data. GPS data, including using taxi and bus GPS information, and smartphone applications can be obtained in all weather conditions and during anytime of the day. GPS data and smartphone application in NYCDOT DMS is discussed herein as a new concept. © 2014 The Authors. Published by Elsevier B.V. Selection and peer-review under responsibility of Elhadi M. Shakshu Keywords: Data Mining System (DMS), New York City, real-time control, active transportation management (ATM), GPS data
A Cryptographic Ensemble for secure third party data analysis: Collaborative data clustering without data owner participation
یک گروه رمزنگاری برای تجزیه و تحلیل داده های شخص ثالث امن: خوشه بندی داده های مشارکتی بدون مشارکت صاحب داده-2019
This paper introduces the twin concepts Cryptographic Ensembles and Global Encrypted Distance Matrices (GEDMs), designed to provide a solution to outsourced secure collaborative data clustering. The cryptographic ensemble comprises: Homomorphic Encryption (HE) to preserve raw data privacy, while supporting data analytics; and Multi-User Order Preserving Encryption (MUOPE) to preserve the privacy of the GEDM. Clustering can therefore be conducted over encrypted datasets without requiring decryption or the involvement of data owners once encryption has taken place, all with no loss of accuracy. The GEDM concept is applicable to large scale collaborative data mining applications that feature horizontal data partitioning. In the paper DBSCAN clustering is adopted for illustrative and evaluation purposes. The results demonstrate that the proposed solution is both efficient and accurate while maintaining data privacy.
Keywords: Data mining as a service | Privacy preserving data mining | Security | Data outsourcing
Determination of the Blood, Hormone and Obesity Value Ranges that Indicate the Breast Cancer, Using Data Mining Based Expert System
تعیین محدوده ارزش خون ، هورمون و چاقی که نشان دهنده سرطان پستان است ، با استفاده از سیستم خبره مبتنی بر داده کاوی-2019
Breast cancer is a dangerous type of cancer that spreads into other organs over time. Therefore, medical studies are being done for the early diagnosis by means of the anthropometric data and blood analysis values besides the mammographic and histological findings. However, medical studies have identified only cancer-related values but the value ranges indicating the cancer have not been determined yet. Concurrently the automated diagnostic systems are being developed to assist medical specialists in biomedical engineering studies. The range of values or boundaries indicating the cancer are automatically determined in biomedical methods, but only the diagnostic result is presented. Because of this, biomedical studies don’t provide enough opportunity for medical experts to evaluate the relationship between values and result. In this study, decision trees that is one of data mining method was applied to anthropometric data and blood analysis values to complete the mentioned deficiencies in breast cancer diagnosis aiming studies. The determined value ranges were also presented visually to medical experts understand them easily. The proposed diagnostic system has accuracy rate up to 90.52% and provides value ranges indicating the breast cancer as well as mathematically presents the relations between the values and cancer.
Keywords: Breast cancer | Data mining | Obesity | Hormone | Blood analysis
Machine learning-based coronary artery disease diagnosis: A comprehensive review
تشخیص بیماری عروق کرونر مبتنی بر یادگیری ماشین: یک مرور جامع-2019
Coronary artery disease (CAD) is the most common cardiovascular disease (CVD) and often leads to a heart attack. It annually causes millions of deaths and billions of dollars in financial losses worldwide. Angiography, which is invasive and risky, is the standard procedure for diagnosing CAD. Alternatively, machine learning (ML) techniques have been widely used in the literature as fast, affordable, and noninvasive approaches for CAD detection. The results that have been published on ML-based CAD diagnosis differ substantially in terms of the analyzed datasets, sample sizes, features, location of data collection, performance metrics, and applied ML techniques. Due to these fundamental differences, achievements in the literature cannot be generalized. This paper conducts a comprehensive and multifaceted review of all relevant studies that were published between 1992 and 2019 for ML-based CAD diagnosis. The impacts of various factors, such as dataset characteristics (geographical location, sample size, features, and the stenosis of each coronary artery) and applied ML techniques (feature selection, performance metrics, and method) are investigated in detail. Finally, the important challenges and shortcomings of ML-based CAD diagnosis are discussed.
Keywords: CAD diagnosis | Machine learning | Data mining | Feature selection
An efficient manifold regularized sparse non-negative matrix factorization model for large-scale recommender systems on GPUs
یک مدل فاکتور گیری ماتریس غیر منفی خلوت منظم شده چند ظرفیتی کارا برای سیستمهای توصیه گر در مقیاس بزرگ بر روی GPU-2019
Article history:Received 31 January 2018Revised 1 July 2018Accepted 25 July 2018Available online 27 July 2018Keywords:Collaborative ﬁltering recommender systemsData miningEuclidean distance and KL-divergence GPU parallelizationManifold regularizationNon-negative matrix factorizationNon-negative Matrix Factorization (NMF) plays an important role in many data mining ap- plications for low-rank representation and analysis. Due to the sparsity that is caused by missing information in many high-dimension scenes, e.g., social networks or recommender systems, NMF cannot mine a more accurate representation from the explicit information. Manifold learning can incorporate the intrinsic geometry of the data, which is combined with a neighborhood with implicit information. Thus, manifold-regularized NMF (MNMF) can realize a more compact representation for the sparse data. However, MNMF suffers from (a) the forming of large-scale Laplacian matrices, (b) frequent large-scale matrix ma- nipulation, and (c) the involved K-nearest neighbor points, which will result in the over- writing problem in parallelization. To address these issues, a single-thread-based MNMF model is proposed on two types of divergence, i.e., Euclidean distance and Kullback–Leibler (KL) divergence, which depends only on the involved feature-tuples’ multiplication and summation and can avoid large-scale matrix manipulation. Furthermore, this model can remove the dependence among the feature vectors with ﬁne-grain parallelization inher- ence. On that basis, a CUDA parallelization MNMF (CUMNMF) is presented on GPU com- puting. From the experimental results, CUMNMF achieves a 20X speedup compared with MNMF, as well as a lower time complexity and space requirement.© 2018 Published by Elsevier Inc.
Keywords: Collaborative filtering recommender systems | Data mining | Euclidean distance and KL-divergence | GPU parallelization | Manifold regularization | Non-negative matrix factorization
Machine learning and data mining frameworks for predicting drug response in cancer: An overview and a novel in silico screening process based on association rule mining
چارچوب های یادگیری ماشین و داده کاوی برای پیش بینی پاسخ به دارو در سرطان: یک مرور کلی و رمان در فرآیند غربالگری سیلیکون بر اساس کاوش قوانین انجمنی-2019
A major challenge in cancer treatment is predicting the clinical response to anti-cancer drugs on a personalized basis. The success of such a task largely depends on the ability to develop computational resources that integrate big “omic” data into effective drug-response models. Machine learning is both an expanding and an evolving computational field that holds promise to cover such needs. Here we provide a focused overview of: 1) the various supervised and unsupervised algorithms used specifically in drug response prediction applications, 2) the strategies employed to develop these algorithms into applicable models, 3) data resources that are fed into these frameworks and 4) pitfalls and challenges tomaximizemodel performance. In this contextwe also describe a novel in silico screening process, based on Association RuleMining, for identifying genes as candidate drivers of drug response and compare it with relevant data mining frameworks, for which we generated a web application freely available at: https://compbio.nyumc.org/drugs/. This pipeline explores with high efficiency large samplespaces, while is able to detect low frequency events and evaluate statistical significance even in the multidimensional space, presenting the results in the form of easily interpretable rules. We conclude with future prospects and challenges of applying machine learning based drug response prediction in precision medicine.
Key words: Drug Response Prediction | Precision Medicine | Data mining | Machine Learning | Association Rule Mining
Learning and predicting operation strategies by sequence mining and deep learning
یادگیری و پیش بینی استراتژی های عملیاتی توسط دنباله سازی و یادگیری عمیق-2019
The operators of chemical technologies are frequently faced with the problem of determining optimal interventions. Our aim is to develop data-driven models by exploring the consequential relationships in the alarm and event-log database of industrial systems. Our motivation is twofold: (1) to facilitate the work of the operators by predicting future events and (2) analyse how consequent the event series is. The core idea is that machine learning algorithms can learn sequences of events by exploring connected events in databases. First, frequent sequence mining applications are utilised to determine how the event sequences evolve during the operation. Second, a sequence-to-sequence deep learning model is proposed for their prediction. The long short-term memory unit-based model (LSTM) is capable of evaluating rare operation situations and their consequential events. The performance of this methodology is presented with regard to the analysis of the alarm and event-log database of an industrial delayed coker unit
Keywords: Alarm management | Data mining | Data preprocessing | Deep learning | LSTM
CWV-BANN-SVM ensemble learning classifier for an accurate diagnosis of breast cancer
طبقه بندی یادگیری گروه CWV-BANN-SVM برای تشخیص دقیق سرطان پستان-2019
This paper presents a new data mining technique for an accurate prediction of breast cancer (BC), which is one of the major mortality causes among women around the globe. The main objective of our study is to expand an automatic expert system (ES) to provide an accurate diagnosis of BC. Both, Support Vector Machines (SVMs) and Artificial Neural Networks (ANNs) were applied to analyze BC data. The wellknown Wisconsin Breast Cancer Dataset (WBCD), available in the UCI repository, was examined in our study. We first tested the SVM algorithm using various values of the C, e and c parameters. As a result of the first experiment, we were able to observe that the adjustment of these regularization parameters can greatly improve the performance of the traditional SVM algorithm applied for BC detection. The highest obtained accuracy at the first step was 99.71%. Then, we performed a new BC detection approach based on two ensemble learning techniques: the confidence-weighted voting method and the boosting ensemble technique. Our model, called CWV-BANNSVM, combines boosting ANNs (BANN) and two SVMs, using optimal parameters selected during the first experiment. The performance of the applied methods was evaluated using several popular metrics, such as specificity, sensitivity, precision, FPR, FNR, F1 score, AUC, Gini and accuracy. The proposed CWV-BANNSVM model was able to improve the performance of the traditional machine learning algorithms applied to BC detection, reaching the accuracy of 100%. To overcome the overfitting issue, we determined and used some appropriate parameter values of polynomial SVM. Our comparison with the existing studies dedicated to BC prediction suggests that the proposed CWV-BANN-SVM model provides one of the best prediction performances overall.
Keywords: Data mining | Machine learning | Ensemble technique | Breast cancer | Support vector machine | Artificial neural network
Recommender system based on pairwise association rules
سیستم تصیه گر مبتنی بر قوانین ارتباط زوج-2019
Recommender systems based on methods such as collaborative and content-based filtering rely on ex- tensive user profiles and item descriptors as well as on an extensive history of user preferences. Such methods face a number of challenges; including the cold-start problem in systems characterized by ir- regular usage, privacy concerns, and contexts where the range of indicators representing user interests is limited. We describe a recommender algorithm that builds a model of collective preferences indepen- dently of personal user interests and does not require a complex system of ratings. The performance of the algorithm is analyzed on a large transactional data set generated by a real-world dietary intake recall system.
Keywords: Association rules | Cold-start problem | Data mining | Ontologies | Recommender systems
Integration of machine learning approaches for accelerated discovery of transition-metal dichalcogenides as Hg0 sensing materials
ادغام رویکردهای یادگیری ماشین برای کشف سریع شتاب دیکلوژنوئیدهای فلز انتقالی به عنوان مواد حسگر Hg0-2019
The detrimental impact of urban airborne Hg0 from fossil fuel utilization has necessitated the discovery and development of Hg0 sensing materials for effective Hg0 detection and mitigation of the pollutant. Earlier studies have hypothetically and experimentally supported 2-dimensional transition-metal dichalcogenides (2D TMDCs), particularly MoS2 to have excellent performance for Hg0 removal. However, the potential of other TMDCs is yet to be investigated for Hg0 sensor application. In this study, a total of 28 transition metals within periods 4–6 of the periodic table, excluding the lanthanides series, were examined. To ensure proper data management flow, a high-throughput data mining approach with integrated machine learning and cheminformatics simulation approaches is developed. The systemic approach integrates the Pymatgen, Factsage, Aflow and density functional theory simulation tools for accelerated discovery of suitable TMDCs from raw data via the chemical vapour reaction route. Predicted results showed that TiS2, NiS2, ZrS2, MoS2, PdS2 and WS2 exhibited TMDCs characteristics. Furthermore, first-principles calculation shows Hg-uptake capacity is in the order NiS2 > PdS2 > TiS2 > ZrS2 > WS2 > MoS2, while Hg sensing response is in the order PdS2 > MoS2 > WS2 > ZrS2 > NiS2 > TiS2. Accordingly, PdS2 depicted to be the most suitable TMDCs for airborne Hg0 sensor application. The proposed systemic approach is an initial platform for materials discovery using integrated machine learning approaches and is well-suited for the screening and the discovery of new materials based on component-oriented structures.
Keywords: Atmospheric Hg0 sensor | Data mining | 2D TMDCs | Machine learning | DFT