What’s in the box?! Towards explainable machine learning applied to non-residential building smart meter classification
جعبه چیست؟ به سمت کاربرد یادگیری ماشین قابل توضیح برای طبقه بندی کنتورهای هوشمند ساختمان غیر مسکونی-2019
Feature engineering and data-driven classification models are at the forefront of analysis of large temporal sensor data from the built environment. In previous effort s, temporal features were engineered from the whole building hourly electrical meter data from 507 non-residential buildings. These features fall within the three general categories of statistics, model, and pattern-based and can be used to identify various behavior in the structure of the whole building electrical meter data. In this paper, a deeper investiga- tion is made of exactly what types of behavior are most important in the context of two classification scenarios: the primary use of a building and the level of performance the building has when compared to its peers. The highly comparative time-series analysis (hctsa) toolkit is used to analyze the most im- portant temporal features for the classification of various building performance attributes. In the first analysis, a comparison is made to distinguish the behavior between university dormitories (70 buildings) and laboratories (95 buildings) as an example of interpreting the classification of the primary-use-type of a building. In the second analysis, a comparison of buildings with high (165 buildings) versus low (169 buildings) consumption is used to extract and understand the behavior that indicates the level of the energy performance of a building. These two case study examples provide a foundation for further ex- plainable machine learning techniques in both classification and prediction as applied to buildings. This effort is the first example of machine learning with an explicit focus on the interpretability of classifica- tion for smart meter data from non-residential buildings.
Keywords: Interpretable machine learning | Explainable machine learning | Building performance analysis | Performance classification | Energy efficiency | Smart meter | Temporal feature engineering | Load clustering | Data science | Customer segmentation | Time-series analysis
Setting up standards: A methodological proposal for pediatric Triage machine learning model construction based on clinical outcomes
تنظیم استانداردها: یک پیشنهاد روش شناختی برای ساخت مدل یادگیری ماشین تراشی کودکان براساس نتایج بالینی-2019
Triage is a critical process in hospital emergency departments (ED). Specifically, we consider how to achieve fast and accurate patient Triage in the ED of a pediatric hospital. The goal of this paper is to establish methodological best practices for the application of machine learning (ML) to Triage in pediatric ED, providing a comprehensive comparison of the performance of ML techniques over a large dataset. Our work is among the first attempts in this direction. Following very recent works in the literature, we use the clinical outcome of a case as its label for supervised ML model training, instead of the more uncertain labels provided by experts. The experimental dataset contains the records along 3 years of operation of the hospital ED. It consists of 189,718 patients visits to the hospital. The clinical outcome of 9271 cases (4.98%) wa hospital admission, therefore our dataset is highly class imbalanced. Our reported performance comparison results focus on four ML models: Deep Learning (DL), Random Forest (RF), Naive Bayes (NB) and Support Vector Machines (SVM). Data preprocessing includes class imbalance correction, and case re-labeling. We use different well known metrics to evaluate performance of ML models in three different experimental settings: (a) classification of each case into the standard five Triage urgency levels, (b) discrimination of high versus low case severity according to its clinical outcome, and (c) comparison of the number of patients assigned to each standard Triage urgency level against the Triage rule based expert system currently in use at the hospital. RF achieved greater AUC, accuracy, PPV and specificity than the other models in the dychotomic classification experiments. On the implementation side, our study shows that ML predictive models trained according to clinical outcomes, provide better Triage performance than the current rule based expert system in operation at the hospital.
Keywords: Machine learning | Emergency department | Triage | Data science | Clinical decision support systems
Environmental data stream mining through a case-based stochastic learning approach
جریا کاوی داده های زیست محیطی از طریق یک رویکرد یادگیری تصادفی مبتنی بر مورد-2018
Environmental data stream mining is an open challenge for Data Science. Common methods used are static because they analyze a static set of data, and provide static data-driven models. Environmental systems are dynamic and generate a continuous data stream. Dynamic methods coping with the tem poral nature of data must be provided in Data Science. Our proposal is to model each environmental information unit, timely generated, as a new case/experience in a Case-Based Reasoning (CBR) system. This contribution aims to incrementally build and manage a Dynamic Adaptive Case Library (DACL). In this paper, a stochastic method for the learning of new cases and management of prototypes to create and manage the DACL in an incremental way is introduced. This stochastic method works with two main moments. An evaluation of the method has been carried using a data stream of air quality of the city of Obregon, Sonora. Mexico, with good results. In addition, other datasets have been mined to ensure the generality of the approach.
Keywords: Data science ، Data stream mining ، Dynamic case learning ، Stochastic learning ، Case-based reasoning ، Air quality detection ، Environmental modelling
On the role of statistics in the era of big data: A computer science perspective
نقش آمار در عصر داده های بزرگ: چشم انداز علوم کامپیوتر-2018
Statistics and computer science are facing remarkably similar discussions on the role of big data. In this article, I advocate that the computer science community has taken advantage of big data since about five decades, thereby building the main commercial companies of today’s computer industry, and specifically I describe the new emphasis on data as the emergence of the so-called Fourth Paradigm. Then, I draw a parallel between the debates on big data occurring within the statistics and computer science community; and finally I advocate for a joint, new and pervasive approach to data science, in which both communities can capitalize on each other’s skills.
Keywords: Big data ، Paradigm shift in statistics and computer ، science ، Fourth paradigm ، Two cultures ، Teaching data science
Internet of Things and Big Data - The Disruption of the Value Chain and the Rise of New Software Ecosystems
اینترنت اشیا و داده های بزرگ - اختلال زنجیره ارزش و ظهور اکوسیستم های نرم افزاری جدید-2016
IoT connects devices, humans, places, and even abstract items like events. Driven by smart sensors, powerful embedded microelectronics, high-speed connectivity and the standards of the internet, IoT is on the brink of disrupting todays value chains. Big Data, characterized by high volume, high velocity and a high variety of formats, is a result of and also a driving force for IoT. The datafication of business presents completely new opportunities and risks. To hedge the technical risks posed by the interaction between “everything”, IoT requires comprehensive modelling tools. Furthermore, new IT platforms and architectures are necessary to process and store the unprecedented flow of structured and unstructured, repetitive and non-repetitive data in real-time. In the end, only powerful analytics tools are able to extract “sense” from the exponentially growing amount of data and, as a consequence, data science becomes a strategic asset.
The era of IoT relies heavily on standards for technologies which guarantee the interoperability of everything. This paper outlines some fundamental standardization activities. Big Data approaches for real-time processing are outlined and tools for analytics are addressed. As consequence, IoT is a (fast) evolutionary process whose success in penetrating all dimensions of life heavily depends on close cooperation between standardization organizations, open source communities and IT experts.
Keywords: Internet of Things | Smart Factories | Big Data | Software Platforms | Data Science
Data science, big data and granular mining
علوم داده، داده های بزرگ و کاوش دانه دانه-2015
With the evolution of various modern technologies, huge amount of data is being constantly generated and collected around us. We are in the midst of what is popularly called information revolution and are living in a so-called world of knowledge. Intentionally and/or accidentally, generation of these data is inevitable. As a result, large data, broadly characterised by three Vs – large volume, velocity and variety (popularly known as “big data”) , is becoming a fancy word, and analysis, access and store of these data are now central to various scientific innovation, public health and welfare, public security and so on. Moreover, big data are highly complex in nature and mining them is not straight forward. Most of the information is heterogeneous, time varying, redundant, uncertain and imprecise. To reason, understand and mine useful knowledge from these data is becoming a great challenge. It is also true that large integrated data sets can potentially provide a much deeper understanding of both the nature and society, and open up new avenues for research activities.