Data Mining Strategies for Real-Time Control in New York City
استراتژی داده کاوی برای کنترل زمان واقعی در شهر نیویورک-2105
The Data Mining System (DMS) at New York City Department of Transportation (NYCDOT) mainly consists of four database systems for traffic and pedestrian/bicycle volumes, crash data, and signal timing plans as well as the Midtown in Motion (MIM) systems which are used as part of the NYCDOT Intelligent Transportation System (ITS) infrastructure. These database and control systems are operated by different units at NYCDOT as an independent database or operation system. New York City experiences heavy traffic volumes, pedestrians and cyclists in each Central Business District (CBD) area and along key arterial systems. There are consistent and urgent needs in New York City for real-time control to improve mobility and safety for all users of the street networks, and to provide a timely response and management of random incidents. Therefore, it is necessary to develop an integrated DMS for effective real-time control and active transportation management (ATM) in New York City. This paper will present new strategies for New York City suggesting the development of efficient and cost-effective DMS, involving: 1) use of new technology applications such as tablets and smartphone with Global Positioning System (GPS) and wireless communication features for data collection and reduction; 2) interface development among existing database and control systems; and 3) integrated DMS deployment with macroscopic and mesoscopic simulation models in Manhattan. This study paper also suggests a complete data mining process for real-time control with traditional static data, current real timing data from loop detectors, microwave sensors, and video cameras, and new real-time data using the GPS data. GPS data, including using taxi and bus GPS information, and smartphone applications can be obtained in all weather conditions and during anytime of the day. GPS data and smartphone application in NYCDOT DMS is discussed herein as a new concept. © 2014 The Authors. Published by Elsevier B.V. Selection and peer-review under responsibility of Elhadi M. Shakshu Keywords: Data Mining System (DMS), New York City, real-time control, active transportation management (ATM), GPS data
Feature based classification of voice based biometric data through Machine learning algorithm
طبقه بندی مبتنی بر ویژگی داده های بیومتریک مبتنی بر صدا از طریق الگوریتم یادگیری ماشین-2021
In the era of big data and growing artificial intelligence, the requirement and necessity of biometric identification increase in a rapid manner. The digitalization and recent Pandemic crisis gives a boost to need to authorized identification which get fulfilled with biometric identification. Our paper focuses on same concept of checking the identification accuracy of machine learning algorithm REPTree on selected bio- metric dataset which is being deployed and evaluated on a data mining tool WEKA. Our target is to achieve more or equal to 95 percentages in order to predict the given sample data is accurately classified into our target variables values i.e. male female. The selected algorithm REPTree is a kind of decision tree classification algorithm which works on same concept as C4.5 and decision tree algorithm with speciality of generation of both kind of output i.e. discrete and continuous. The selection of algorithm gives us ben- efits with achievement of higher accuracy and selection of dataset also become easy with some required modification and pre-processing of data with some dimension reduction filters.© 2021 Elsevier Ltd. All rights reserved. Selection and peer-review under responsibility of the scientific committee of the 1st International Con- ference on Computations in Materials and Applied Engineering – 2021.
Keywords: Prediction | Biometric data | Voice samples | Male | Female | Cost complexity pruning (CCP) | Dimension reduction
Towards a pragmatic detection of unreliable accounts on social networks
به سوی تشخیص عملی حسابهای غیر قابل اعتماد در شبکه های اجتماعی-2021
In recent years, the problem of unreliable content in social networks has become a major threat, with a proven real-world impact in events like elections and pandemics, undermining democracy and trust in science, respectively. Research in this domain has focused not only on the content but also on the accounts that propagate it, with the bot detection task having been thoroughly studied. However, not all bot accounts work as unreliable content spreaders (p.e. bot for news aggregation), and not all human accounts are necessarily reliable. In this study, we try to distinguish unreliable from reliable accounts, independently of how they are operated. In addition, we work towards providing a methodology capable of coping with real-world situations by introducing the content available (restricting it by volume- and time-based batches) as a parameter of the methodology. Experiments conducted on a validation set with a different number of tweets per account provide evidence that our proposed solution produces an increase of up to 20% in performance when compared with traditional (individual) models and with cross-batch models (which perform better with different batches of tweets).
Keywords: Unreliable accounts detection | Social networks | Machine learning | Data mining | Volume and time adaptive methodology
A fuzzy based hybrid decision framework to circularity in dairy supply chains through big data solutions
چارچوب تصمیم ترکیبی مبتنی بر فازی برای مدور بودن در زنجیره های تامین لبنیات از طریق راه حل های داده های بزرگ-2021
This study determines the potential barriers to achieving circularity in dairy supply chains; it proposes a framework which covers big data driven solutions to deal with the suggested barriers. The main contribution of the study is to propose a framework by making ideal matching and ranking of big data solutions to barriers to circularity in dairy supply chains. This framework further offers a specific roadmap as a practical contribution while investigating companies with restricted resources. In this study the main barriers are classified as ‘eco- nomic’, ‘environmental’, ‘social and legal’, ‘technological’, ‘supply chain management’ and ‘strategic’ with twenty-seven sub-barriers. Various big data solutions such as machine learning, optimization, data mining, cloud computing, artificial neural network, statistical techniques and social network analysis have been suggested. Big data solutions are matched with circularity focused barriers to show which solutions succeed in overcoming barriers. A hybrid decision framework based on the fuzzy ANP and the fuzzy VIKOR is developed to find the weights of the barriers and to rank the big data driven solutions. The results indicate that among the main barriers, ‘economic’ was of the highest importance, followed by ‘technological’, ‘environmental’, ‘strategic’, ‘supply chain management’ then ‘social and legal barrier’ in dairy supply chains. In order to overcome circularity focused barriers, ‘optimization’ is determined to be the most important big data solution. The other solutions to overcoming proposed challenges are ‘data mining’, ‘machine learning’, ‘statistical techniques’ and ‘artificial neural network’ respectively. The suggested big data solutions will be useful for policy makers and managers to deal with potential barriers in implementing circularity in the context of dairy supply chains.
Keywords: Dairy supply chain | Barriers | Circular economy | Big data solution | Fuzzy ANP - VIKOR | Group decision making system
An analysis of Twitter users’ long term political view migration using cross-account data mining
تجزیه و تحلیل از مهاجرت دیدگاه های طولانی مدت کاربران توییتر با استفاده از داده های متقابل حسابداری-2021
During the 2016 US presidential election, we witnessed a polarized population and an election outcome that defied the predictions of many media sources. In this study, we conducted a follow-up on political view migration through tracking Twitter users’ account activity. The study was conducted by following a set of Twitter users over a four year period. Each year, Twitter user activities were collected and analyzed by our novel cross-account data mining algorithm. This algorithm through multiple iterations computes a numerical political score for each user based on their connection to other users and hashtags. We identified a set of seed users and hashtags using prominent political figures and movements to bootstrap the algorithm. The political score distribution demonstrates a divided population on political views. We also observed that users are more moderate in years close to elections (2017 and 2020) compared to years of none election (2018 and 2019). There is an overall migration trend from conservatives to progressives during the four years. This change in scores across the four year time frame suggests a unique political cycle exclusive to Donald Trump’s unprecedented presidential term. Our results in a broad sense portray the potential capabilities of a data collection and scoring algorithm that detected a noticeable political migration and describes the broad social characteristics of certain politically aligned users on social media platforms.
keywords: شبکه های اجتماعی | سیاست | توییتر | داده کاوی | Social networks | Politics | Twitter | Datamining
Data-driven detection and characterization of communities of accounts collaborating in MOOCs
شناسایی و توصیف مبتنی بر داده جوامع حسابهایی که در MOOC همکاری میکنند-2021
Collaboration is considered as one of the main drivers of learning and it has been broadly studied across numerous contexts, including Massive Open Online Courses (MOOCs). The research on MOOCs has risen exponentially during the last years and there have been a number of works focused on studying collaboration. However, these previous studies have been restricted to the analysis of collaboration based on the forum and social interactions, without taking into account other possibilities such as the synchronicity in the interactions with the platform. Therefore, in this work we performed a case study with the goal of implementing a data-driven approach to detect and characterize collaboration in MOOCs. We applied an algorithm to detect synchronicity links based on their submission times to quizzes as an indicator of collaboration, and applied it to data from two large Coursera MOOCs. We found three different profiles of user accounts, that were grouped in couples and larger communities exhibiting different types of associations between user accounts. The characterization of these user accounts suggested that some of them might represent genuine online learning collaborative associations, but that in other cases dishonest behaviors such as free-riding or multiple account cheating might be present. These findings call for additional research on the study of the kind of collaborations that can emerge in online settings.
keywords: تجزیه و تحلیل یادگیری | داده کاوی آموزشی | یادگیری مشارکتی | دوره های آنلاین گسترده باز | هوش مصنوعی | Learning analytics | Educational data mining | Collaborative learning | Massive open online courses | Artificial intelligence
The use of big data and data mining in nurse practitioner clinical education
استفاده از داده های بزرگ و داده کاوی در آموزش بالینی پزشکان -2020
Nurse practitioner (NP) faculty have not fully used data collected in NP clinical education for data mining. With current advances in database technology including data storage and computing power, NP faculty have an opportunity to data mine enormous amounts of clinical data documented by NP students in electronic clinical management systems. The purpose of this project was to examine the use of big data and data mining from NP clinical education and to establish a foundation for competency-based education. Using a data mining knowledge discovery process, faculty are able to gain increased understanding of clinical practicum experiences to inform competency-based NP education and the use of entrusted professional activities for the future.
Keywords: Big data | Data mining | Nurse practitioner clinical education | Competency-based education | Nurse Practitioner Core Competencies | Entrustable professional activities
Data mining of customer choice behavior in internet of things within relationship network
داده کاوی رفتار انتخاب مشتری در اینترنت اشیایی که در شبکه ارتباطی قرار دارند-2020
Internet of Things has changed the relationship between traditional customer networks, and traditional information dissemination has been affected. Smart environment accelerates the changes in customer behaviors. Apparently, the new customer relationship network, benefitted from the Internet of Things technology, will imperceptibly influence customer choice behaviors for the cyber intelligence. In this work, we selected 298 customers click browsing records as training data, and collected 50 customers who used the platform for the first time as research objects. and use the smart customer relationship network correspond to cyber intelligence to build the customer intelligence decision model in Internet of Things. The results showed that the MAE (Mean Absolute Deviation) of the customer trust evaluation model constructed in this study is 0.215, 45% improvement over the traditional equal assignment method. In addition, customers consumer experience can be enhanced with the support of data mining technology in cyber intelligence. Our work indicated the key to build eliminates confusion in customer choice behavior mechanism is to establish a consumer-centric, effective network of customers and service providers, and to be supported by the Internet of Things, big data analysis, and relational fusion technologies.
Keywords: Internet of things | Customer relationship network | Decision making | Recommendation | Fusion algorithm
Data mining and application of ship impact spectrum acceleration based on PNN neural network
داده کاوی و کاربرد شتاب طیف تأثیر کشتی بر اساس شبکه عصبی PNN-2020
The selection of the smoothing coefficient of the probabilistic neural network directly affects the performance of the network. Traditionally, all the mode layer neurons use a uniform smoothing coefficient, and then the optimal smoothing parameters suitable for this problem are searched by the optimization algorithm. In this study, the smoothing coefficients of the mode layer neurons connected by the same summation layer are set to the same value, which not only reflects the relationship between the training samples of the same pattern, but also highlights the difference between the training samples of different modes. Two probabilistic neural network models are applied to the ship impact environment prediction respectively. The results show that the classification effect of multiple smoothing factors is further improved than the single smoothing factor network.
Keywords: Ship impact environment prediction | Probabilistic neural network | Smoothing coefficient | Optimization algorithm
Pivot-based approximate k-NN similarity joins for big high-dimensional data
پیوندهای شباهت تقریبی k-NN مبتنی بر محوری برای داده های بزرگ با ابعاد بزرگ-2020
Given an appropriate similarity model, the k-nearest neighbor similarity join represents a useful yet costly operator for data mining, data analysis and data exploration applications. The time to evaluate the operator depends on the size of datasets, data distribution and the dimensionality of data representations. For vast volumes of high-dimensional data, only distributed and approximate approaches make the joins practically feasible. In this paper, we investigate and evaluate the performance of multiple MapReduce-based approximate k-NN similarity join approaches on two leading Big Data systems Apache Hadoop and Spark. Focusing on the metric space approach relying on reference dataset objects (pivots), this paper investigates distributed similarity join techniques with and without approximation guarantees and also proposes high-dimensional extensions to previously proposed algorithms. The paper describes the design guidelines, algorithmic details, and key theoretical underpinnings of the compared approaches and also presents the empirical performance evaluation, approximation precision, and scalability properties of the implemented algorithms. Moreover, the Spark source code of all these algorithms has been made publicly available. Key findings of the experimental analysis are that randomly initialized pivot-based methods perform well with big highdimensional data and that, in general, the selection of the best algorithm depends on the desired levels of approximation guarantee, precision and execution time.
Keywords: Hadoop | Spark | MapReduce | k-NN | Approximate similarity join | High-dimensional data