Data Mining Strategies for Real-Time Control in New York City
استراتژی داده کاوی برای کنترل زمان واقعی در شهر نیویورک-2105
The Data Mining System (DMS) at New York City Department of Transportation (NYCDOT) mainly consists of four database systems for traffic and pedestrian/bicycle volumes, crash data, and signal timing plans as well as the Midtown in Motion (MIM) systems which are used as part of the NYCDOT Intelligent Transportation System (ITS) infrastructure. These database and control systems are operated by different units at NYCDOT as an independent database or operation system. New York City experiences heavy traffic volumes, pedestrians and cyclists in each Central Business District (CBD) area and along key arterial systems. There are consistent and urgent needs in New York City for real-time control to improve mobility and safety for all users of the street networks, and to provide a timely response and management of random incidents. Therefore, it is necessary to develop an integrated DMS for effective real-time control and active transportation management (ATM) in New York City. This paper will present new strategies for New York City suggesting the development of efficient and cost-effective DMS, involving: 1) use of new technology applications such as tablets and smartphone with Global Positioning System (GPS) and wireless communication features for data collection and reduction; 2) interface development among existing database and control systems; and 3) integrated DMS deployment with macroscopic and mesoscopic simulation models in Manhattan. This study paper also suggests a complete data mining process for real-time control with traditional static data, current real timing data from loop detectors, microwave sensors, and video cameras, and new real-time data using the GPS data. GPS data, including using taxi and bus GPS information, and smartphone applications can be obtained in all weather conditions and during anytime of the day. GPS data and smartphone application in NYCDOT DMS is discussed herein as a new concept. © 2014 The Authors. Published by Elsevier B.V. Selection and peer-review under responsibility of Elhadi M. Shakshu Keywords: Data Mining System (DMS), New York City, real-time control, active transportation management (ATM), GPS data
Quantile regression in big data: A divide and conquer based strategy
رگرسیون کمی در داده های بزرگ: یک استراتژی مبتنی بر تقسیم و غلبه-2020
Quantile regression, which analyzes the conditional distribution of outcomes given a set of covariates, has been widely used in many fields. However, the volume and velocity of big data make the estimation of quantile regression model extremely difficult due to the intensive computation and the limited storage. Based on divide and conquer strategy, a simple and efficient method is proposed to address this problem. The proposed approach only keeps summary statistics of each data block and then can use them to reconstruct the estimator of the entire data with asymptotically negligible approximation error. This property makes the proposed method particularly appealing when data blocks are retained in multiple servers or come in the form of data stream. Furthermore, the proposed estimator is shown to be consistent and asymptotically as efficient as the estimating equation estimator calculated using the entire data together when certain conditions hold. The merits of the proposed method are illustrated using both simulation studies and real data analysis
Keywords: Data stream | Divide and conquer | Estimating equation | Massive data sets | Quantile regression
Bias reduction in the population size estimation of large data sets
کاهش تمایل در برآورد اندازه جمعیت مجموعه داده های بزرگ-2020
Estimation of the population size of large data sets and hard to reach populations can be a significant problem. For example, in the military, manpower is limited and the manual processing of large data sets can be time consuming. In addition, accessing the full population of data may be restricted by factors such as cost, time, and safety. Four new population size estimators are proposed, as extensions of existing methods, and their performances are compared in terms of bias with two existing methods in the big data literature. These would be particularly beneficial in the context of time-critical decisions or actions. The comparison is based on a simulation study and the application to five real network data sets (Twitter, LiveJournal, Pokec, Youtube, Wikipedia Talk). Whilst no single estimator (out of the four proposed) generates the most accurate estimates overall, the proposed estimators are shown to produce more accurate population size estimates for small sample sizes, but in some cases show more variability than existing estimators in the literature.
Keywords: Relative bias | Twitter | Size estimator | Youtube | Random walk sampling
Refined composite multivariate multiscale symbolic dynamic entropy and its application to fault diagnosis of rotating machine
آنتروپی پویای نمادین چند متغیره کامپوزیت تصفیه شده و کاربرد آن در تشخیص خطای ماشین چرخشی-2020
Accurate and efficient identification of various fault categories, especially for the big data and multisensory system, is a challenge in rotating machinery fault diagnosis. For the diagnosis problems with massive multivariate data, extracting discriminative and stable features with high efficiency is the significant step. This paper proposes a novel feature extraction method, called Refined Composite multivariate Multiscale Symbolic Dynamic Entropy (RCmvMSDE), based on the refined composite analysis and multivariate multiscale symbolic dynamic entropy. Specifically, multivariate multiscale symbolic dynamic entropy can capture more identification information from multiple sensors with superior computational efficiency, while refine composite analysis guarantees its stability. The abilities of the proposed method to measure the complexity of multivariate time series and identify the signals with different components are discussed based on adequate simulation analysis. Further, to verify the effectiveness of the proposed method on fault diagnosis tasks, a centrifugal pump dataset under constant speed condition and a ball bearing dataset under time-varying speed condition are applied. Compared with the existing methods, the proposed method improves the classification accuracy and F-score to 99.81% and 0.9981, respectively. Meanwhile, the proposed method saves at least half of the computational time. The result shows that the proposed method is effective to improve the efficiency and classification accuracy dealing with the massive multivariate signals.
Keywords: Multivariate multiscale symbolic dynamic | entropy | Random forest | Time-varying speed conditions | Fault diagnosis
Can the development of a patient’s condition be predicted through intelligent inquiry under the e-health business mode? Sequential feature map-based disease risk prediction upon features selected from cognitive diagnosis big dat
آیا می توان از طریق استعلام هوشمند تحت شرایط تجارت الکترونیکی ، وضعیت یک بیمار را پیش بینی کرد؟ پیش بینی خطر ابتلا به بیماری مبتنی بر ویژگی های توالی بر ویژگی های انتخاب شده از تشخیص شناختی داده های بزرگ-2020
The data-driven mode has promoted the researches of preventive medicine. In prediction of disease risks, physicians’ clinical cognitive diagnosis data can be used for early prevention of diseases and, therefore, to reduce medical cost, to improve accessibility of medical services and to lower medical risk. However, researches involved no physicians’ cognition of patients’ conditions in intelligent inquiry under e-health business mode, offered no diagnosis big data, neglected the values of the fused text information generated by joint activities of online and offline medical data, and failed to thoroughly analyze the phenomenon of redundancy-complementarity dispersion caused by high-order information shortage from the online inquiry data-driven perspective. Besides, the risk prediction simply based on offline clinical cognitive diagnosis data undoubtedly reduces prediction precision. Importantly, relevant researches rarely considered temporal relationships of different medical events, did not conduct detailed analysis on practical problems of pattern explosion, did not offer a thought of intelligent portrayal map, and did not conduct relevant risk prediction based on the sub-maps obtained from the map. In consequence, the paper presents a disease risk prediction method with the model for redundancy-complementarity dispersion-based feature selection from physicians’ online cognitive diagnosis big data to realize features selection from the cognitive diagnosis big data of online intelligent inquiry; the obtained features were ranked intelligently for subsequent high-dimensional information shortage compensation; the compensated key feature information of the cognitive diagnosis big data was fused with offline electronic medical record (EMR) to form the virtual electronic medical record (VEMR). The formed VEMR was combined with the method of the sequential feature map for modelling, and a sequential feature map-based model for disease risk prediction was presented to obtain online users’ medical conditions. A neighborhood-based collaborative prediction model was presented for prediction of an online intelligent medical inquiry user’s possible diseases in the future and to intelligently rank the risk probabilities of the diseases. In the experiments, the online intelligent medical inquiry users’ VEMRs were used as the foundation of the simulation experiments to predict disease risks in chronic obstructive pulmonary disease (OCPD) population and rheumatic heart disease (RHD) population. The experiments demonstrated that the presented method showed relatively good metric performances in the VEMR and improved disease risk prediction.
Keywords: Cognitive diagnosis big data | Online intelligent inquiry | Sequential feature map | Disease risk prediction | Redundancy and complementarity dispersion
Wake modeling of wind turbines using machine learning
مدل سازی توربین های بادی با استفاده از یادگیری ماشین-2020
In the paper, a novel framework that employs the machine learning and CFD (computational fluid dynamics) simulation to develop new wake velocity and turbulence models with high accuracy and good efficiency is proposed to improve the turbine wake predictions. An ANN (artificial neural network) model based on the backpropagation (BP) algorithm is designed to build the underlying spatial relationship between the inflow conditions and the three-dimensional wake flows. To save the computational cost, a reduced-order turbine model ADM-R (actuator disk model with rotation), is incorporated into RANS (Reynolds-averaged Navier-Stokes equations) simulations coupled with a modified k − ε turbulence model to provide big datasets of wake flow for training, testing, and validation of the ANN model. The numerical framework of RANS/ADM-R simulations is validated by a standalone Vestas V80 2MW wind turbine and NTNU wind tunnel test of double aligned turbines. In the ANN-based wake model, the inflow wind speed and turbulence intensity at hub height are selected as input variables, while the spatial velocity deficit and added turbulence kinetic energy (TKE) in wake field are taken as output variables. The ANN-based wake model is first deployed to a standalone turbine, and then the spatial wake characteristics and power generation of an aligned 8-turbine row as representation of Horns Rev wind farm are also validated against Large Eddy Simulations (LES) and field measurement. The results of ANNbased wake model show good agreement with the numerical simulations and measurement data, indicating that the ANN is capable of establishing the complex spatial relationship between inflow conditions and the wake flows. The machine learning techniques can remarkably improve the accuracy and efficiency of wake predictions.
Keywords: Wind turbine wake | Wake model | Artificial neural network (ANN) | Machine learning | ADM-R (actuator-disk model with rotation) | model | Computational fluid dynamics (CFD)
Complementarity modeling of monthly streamflow and wind speed regimes based on a copula-entropy approach: A Brazilian case study
مدل سازی مکمل رژیم های ماهانه جریان و سرعت باد بر اساس یک رویکرد کوپل-آنتروپی: یک مطالعه موردی برزیل-2020
Wind power energy has been showing significant growth in installed capacity around the world. This opportunity presents big challenges to operate power systems with high wind power penetration levels, considering the variability and intermittent behavior of this type of power source. To reduce uncertainties associated with this kind of power systems, researchers have explored the integration of wind power energy with other renewable energy sources, like solar and hydropower. For instance, the integration of wind and hydro systems can deal with the spatial and temporal complementarity of hydrological and wind regimes to produce energy. Therefore, it is necessary to consider the stochastic behavior and the dependence structures between these variables to define better operational policies. This study explores the spatial correlation of hydrological and wind regimes in different regions of Brazil and defines an entropy-copula-based model for the joint simulation of monthly streamflow and wind speed time series to evaluate the potential integration of hydro and wind energy sources. The proposed model showed a good adherence to the periodic behavior for both variables, and the results indicate that simulated scenarios preserved statistical features of historical data
Keywords: Hydro-wind complementary | Renewable energy | Stochastic modeling
Review of methods used to estimate the sky view factor in urban street canyons
مروری بر روشهای مورد استفاده برای تخمین عامل نمای آسمان در دره های خیابانی شهری-2020
The sky view factor (SVF) is the ratio of the visible sky area of a point in space to the total sky area. It provides the relationship between the visible sky area and covered surroundings, such as by buildings or street trees. The SVF has been widely used as a key parameter in urban climate research and urban planning practices. Significant research has taken place in the past decades on methods of calculating/estimating SVFs to improve their accuracy and efficiency. This review lists the methods used to calculate/estimate SVFs including geometric methods, fish-eye photographical method, Global Positioning System methods, simulation methods based on 3D city models or digital surface models, and big data approaches using street view images. We stress the principles, input data, application, accuracy and efficiency of each method. This review is meaningful for climatologists in solar radiation modeling and energy balance modeling fields, as well as for urban planners in the development of design guidelines to improve outdoor thermal comfort in the urban environment.
Keywords: Sky view factor | Urban street canyon | Aspect ratio | Street panoramic images | Urban planning
Dynamic occupant density models of commercial buildings for urban energy simulation
مدلهای چگالی اشغال پویا ساختمانهای تجاری برای شبیه سازی انرژی شهری-2020
The number of occupants and its changing pattern over time are key information for building and urban energy simulation. However, the commonly used assumption and simplification of a fixed occupancy schedule does not reflect the complicated reality, leading to significant errors in energy simulation. Therefore, dynamic occupant density models which describe the real-world situation more accurately should be developed. This paper presents a methodology to develop such a model for commercial buildings and expand it from the building level to urban level. First, a total of 2275 commercial buildings in Nanjing, a major city in China, are identified and classified into three sub-categories using Points of Interest and logistic regression. Then field measurement is conducted to obtain the hourly occupant density for 12 sample commercial buildings. The building-level dynamic occupant density model is developed by fitting normal distribution functions into the measured data. Finally, transportation accessibility and population level, two urban parameters, are defined and used to expand the buildinglevel occupant density model to the urban-level one. The dynamic urban-level occupant density model is verified for all three sub-categories of commercial buildings and the overall results are acceptable.
Keywords: Big data | Commercial buildings | Urban-level | Dynamic occupant density models
Aggregation of inputs and outputs prior to Data Envelopment Analysis under big data
جمع شدن ورودی ها و خروجی ها قبل از تجزیه و تحلیل پوششی داده ها تحت داده های بزرگ-2020
The main goal of this paper is to explore the possible solutions to a ‘big data’ problem related to the very large dimensions of input–output data. In particular, we focus on the cases of severe ‘curse of di- mensionality’ problem that require dimension-reduction prior to using Data Envelopment Analysis. To achieve this goal, we have presented some theoretical grounds and performed a new to the literature simulation study where we explored the price-based aggregation as a solution to address the problem of very large dimensions.
Keywords: Data Envelopment Analysis | Productivity | Efficiency | Big data