A dynamic classification unit for online segmentation of big data via small data buffers
واحد طبقه بندی پویا برای تقسیم آنلاین داده های بزرگ از طریق بافر داده های کوچک-2020
In many segmentation processes, we assign new cases according to a model that was built on the basis of past cases. As long as the new cases are “similar enough” to the past cases, segmentation proceeds normally. However, when a new case is substantially different from the known cases, a reexamination of the previously created segments is required. The reexamination may result in the creation of new segments or in the updating of the existing ones. In this paper, we assume that in big and dynamic data environments it is not possible to reexamine all past data and, therefore, we suggest using small groups of selected cases, stored in small data buffers, as an alternative to the collection of all past data. We present an incremental dynamic classifier that supports real-time unsupervised segmentation in big and dynamic data environments. In order to reduce the computational effort of unsupervised clustering in such environments, the suggested model performs calculations only on the relevant data buffers that store the relevant representative cases. In addition, the suggested model can serve as a dynamic classification unit (DCU) that can act as an autonomous agent, as well as collaborate with other DCUs. The evaluation is presented by comparing three approaches: static, dynamic, and incremental dynamic.
Keywords: Incremental dynamic classifier | Dynamic segmentation | Incremental data analysis | Cluster analysis | Classification | Big data
A framework for extracting urban functional regions based on multi prototype word embeddings using points-of-interest data
چارچوبی برای استخراج مناطق عملکردی شهری بر اساس تعبیه چند کلمه نمونه اولیه با استفاده از داده های مورد علاقه-2020
Many studies are in an effort to explore urban spatial structure, and urban functional regions have become the subject of increasing attention among planners, engineers and public officials. Attempts have been made to identify urban functional regions using high spatial resolution (HSR) remote sensing images and extensive geodata. However, the research scale and throughput have also been limited by the accessibility of HSR remote sensing data. Recently, big geo-data are becoming increasingly popular for urban studies since research is still accessible and objective with regard to the use of these data. This study aims to build a novel framework to provide an alternative solution for sensing urban spatial structure and discovering urban functional regions based on emerging geo-data – points of interest (POIs) data and an embedding learning method in the natural language processing (NLP) field. We started by constructing the intraurban functional corpus using a centercontext pairs-based approach. A word embeddings representation model for training that corpus was used to extract multiprototype vectors in the second step, and the last step aggregated the functional parcels based on an introduced spatial clustering method, hierarchical density-based spatial clustering of applications with noise (HDBSCAN). The clustering results suggested that our proposed framework used in this study is capable of discovering the utilization of urban space with a reasonable level of accuracy. The limitation and potential improvement of the proposed framework are also discussed.
Keywords: Urban functional regions | Word embeddings | Points-of-interest | Spatial clusters
Managing minority opinions in micro-grid planning by a social network analysis-based large scale group decision making method with hesitant fuzzy linguistic information
مدیریت نظرات اقلیت ها در برنامه ریزی خرد شبکه ای با استفاده از روش تصمیم گیری گروهی مقیاس بزرگ مبتنی بر تحلیل شبکه های اجتماعی با اطلاعات زبانی فازی مردد-2020
The growth of global electricity demand has put forward higher requirements for power distribution networks. The high cost of the large-scale power system and the voice for the use of renewable energy impel the birth of the micro-grid which plays a complementary role in the power generation of large-scale power system. The construction of micro-grid planning is complex and many stakeholders’ opinions should be considered for a comprehensive evaluation. Furthermore, the development of social big data techniques, such as e-marketplace and e-democracy, makes experts have social relationships among them. This study aims to develop a consensus model to manage minority opinions for largescale group decision making with social network analysis for micro-grid planning. To deal with the vague and uncertain features in complex micro-grid planning problems, experts are supposed to use hesitant fuzzy linguistic term sets to express their opinions. A social network analysis-based clustering method is introduced to classify experts. Besides, in a large-scale group decision making problem, the opinions of experts should be fully considered, especially the minority opinions. This model considers the minority opinions in a micro-grid planning problem and provides an approach to manage these opinions. Finally, we use an illustrative example concerning the micro-grid planning decision making in Ali district in Tibet to demonstrate the effectiveness and practicability of the proposed model.
Keywords: Micro-grid planning | Large-scale group decision making | Social network analysis | Minority opinions | Hesitant fuzzy linguistic term sets | Consensus
A new fast search algorithm for exact k-nearest neighbors based on optimal triangle-inequality-based check strategy
یک الگوریتم جستجوی سریع جدید برای همسایگان دقیق k-مبتنی بر استراتژی بررسی مبتنی بر مثلث-نابرابری بهینه-2020
The k-nearest neighbor (KNN) algorithm has been widely used in pattern recognition, regression, outlier detection and other data mining areas. However, it suffers from the large distance computation cost, especially when dealing with big data applications. In this paper, we propose a new fast search (FS) algorithm for exact k-nearest neighbors based on optimal triangle-inequality-based (OTI) check strategy. During the procedure of searching exact k-nearest neighbors for any query, the OTI check strategy can eliminate more redundant distance computations for the instances located in the marginal area of neighboring clusters compared with the original TI check strategy. Considering the large space complexity and extra time complexity of OTI, we also propose an efficient optimal triangle-inequalitybased (EOTI) check strategy. The experimental results demonstrate that our proposed two algorithms (OTI and EOTI) achieve the best performance compared with other related KNN fast search algorithms, especially in the case of dealing with high-dimensional datasets
Keywords: Exact k-nearest neighbors | Fast search algorithm | Clustering | Triangle inequality | Optimal check strategy
Veracity handling and instance reduction in big data using interval type-2 fuzzy sets
کنترل صحت و کاهش نمونه در داده های بزرگ با استفاده از بازها های مجموعه های فازی نوع 2-2020
Within the aspect of big data, veracity refers to the existing uncertainty in the dataset. The continuous flow of unstructured data with unwanted noise may bring abnormality in the dataset making them unusable. In this paper, we propose a novel method to handle the veracity characteristic of the big data using the concept of footprint of uncertainty (FOU) in interval type-2 fuzzy sets (IT2 FSs). The proposed method helps in handling the veracity issue in big data and reduces the instances to a manageable extent. We have compared the results with the existing clustering based methods and examined the relationship between the clusters and the FOUs by comparing their centroids and defuzzified values. To scrutinize the validity of our results, we have also performed a number of additional experiments by appending extra instances to the datasets. To check its consistency and efficacy, the proposed methodology is assessed from three different aspects. Experimental result validates that the proposed method can suitably handle the veracity issue in big datasets and is efficient in reducing the instances.
Keywords: Instance reduction | Big data veracity | Interval type-2 fuzzy sets | Cluster centroid | Footprint of uncertainty
Forecasting across time series databases using recurrent neural networks on groups of similar series: A clustering approach
پیش بینی پیش بینی پایگاه داده های سری زمانی با استفاده از شبکه های عصبی مکرر در گروه های مشابه سری: یک روش خوشه بندی-2020
With the advent of Big Data, nowadays in many applications databases containing large quantities of sim- ilar time series are available. Forecasting time series in these domains with traditional univariate fore- casting procedures leaves great potentials for producing accurate forecasts untapped. Recurrent neural networks (RNNs), and in particular Long Short Term Memory (LSTM) networks, have proven recently that they are able to outperform state-of-the-art univariate time series forecasting methods in this context, when trained across all available time series. However, if the time series database is heterogeneous, ac- curacy may degenerate, so that on the way towards fully automatic forecasting methods in this space, a notion of similarity between the time series needs to be built into the methods. To this end, we present a prediction model that can be used with different types of RNN models on subgroups of similar time series, which are identified by time series clustering techniques. We assess our proposed methodology using LSTM networks, a widely popular RNN variant, together with various clustering algorithms, such as kMeans, DBScan, Partition Around Medoids (PAM), and Snob. Our method achieves competitive results on benchmarking datasets under competition evaluation procedures. In particular, in terms of mean sMAPE accuracy it consistently outperforms the baseline LSTM model, and outperforms all other methods on the CIF2016 forecasting competition dataset.
Keywords: Big data forecasting | RNN | LSTM | Time series clustering | Neural networks
Associations of hospital discharge services with potentially avoidable readmissions within 30 days among older adults after rehabilitation in acute care hospitals in Tokyo, Japan
انجمن خدمات ترخیص بیمارستان با بستری مجدد بالقوه قابل اجتناب در عرض 30 روز در میان سالمندان بعد از توانبخشی در بیمارستانهای مراقبت حاد در توکیو ، ژاپن-2020
OBJECTIVE: To examine the associations of three major hospital discharge services covered under health insurance (discharge planning, rehabilitation discharge instruction, and coordination with community care) with potentially avoidable readmissions within 30 days (30-day PAR) in older adults after rehabilitation in acute care hospitals in Tokyo, Japan.
DESIGN: Retrospective cohort study using a large-scale medical claims database of all Tokyo residents aged ≥75 years. SETTING: Acute care hospitals PARTICIPANTS: Patients who underwent rehabilitation and were discharged to home (n=31,247; mean age: 84.1 years, standard deviation: 5.7 years) between October 2013 and July 2014.
MAIN OUTCOME MEASURE: 30-day PAR.
RESULTS: Among the patients, 883 (2.9%) experienced 30-day PAR. A multivariable logistic generalized estimating equation model (with a logit link function and binominal sampling distribution) that adjusted for patient characteristics and clustering within hospitals showed that the discharge services were not significantly associated with 30-day PAR. The odds ratios were 0.962 (95% confidence interval [CI]: 0.805-1.151) for discharge planning, 1.060 (95% CI: 0.916-1.227) for rehabilitation discharge instruction, and 1.118 (95% CI: 0.817-1.529) for coordination with community care. In contrast, the odds of 30-day PAR among patients with home medical care services were 1.431 times higher than those of patients without these services (P<0.001), and the odds of 30-day PAR among patients with a higher number (median or higher) of rehabilitation units were 2.031 times higher than those of patients with a lower number (below median) (P<0.001). Also, the odds of 30-day PAR among patients with a higher hospital frailty risk score (median or higher) were 1.252 times higher than those of patients with a lower score (below median) (P=0.001).
CONCLUSIONS: The insurance-covered discharge services were not associated with 30-day PAR, and the development of comprehensive transitional care programs through the integration of existing discharge services may help to reduce such readmissions.
Copyright © 2020. Published by Elsevier Inc.
KEYWORDS: Big data; health services for the aged; patient readmission; rehabilitation; transitional care
Parallel hierarchical architectures for efficient consensus clustering on big multimedia cluster ensembles
معماری سلسله مراتبی موازی برای خوشه بندی اجماع کارآمد در مجموعه های بزرگ خوشه چندرسانه ای-2020
Consensus clustering is a useful tool for robust or distributed clustering applications. How- ever, given the fact that time complexities of the consensus functions scale linearly or quadratically with the number of combined clusterings, execution can be slow or even impossible when operating on big cluster ensembles, a situation encountered when we pursue robust multimedia data clustering. This work introduces hierarchical consensus ar- chitectures, an inherently parallel approach based on the divide-and-conquer strategy for computationally efficient consensus clustering, in a bid to make faster, more effective con- sensus clustering possible in big multimedia cluster ensemble scenarios. Moreover, we de- fine a specific implementation of hierarchical architectures, including a theoretical analysis of its fully parallel implementation computational complexity. In experiments conducted on unimodal and multimedia data sets involving small and big cluster ensembles, we find parallel hierarchical consensus architectures variants perform faster than traditional flat consensus in 75% of the experiments on small cluster ensembles, a percentage that rises to 100% on unimodal and multimedia big cluster ensembles, achieving an average speedup ratio of 30.5. Moreover, depending on the consensus function employed, the quality of the obtained consensus partitions ensures robust clustering results.
Keywords: Consensus clustering | Big cluster ensembles | Multimedia clustering | Parallelization | Divide-and-conquer
Spatially varying impacts of built environment factors on rail transit ridership at station level: A case study in Guangzhou, China
تأثیرات مکانی متغیر از عوامل محیطی ساخته شده بر رکود حمل و نقل ریلی در سطح ایستگاه: یک مطالعه موردی در گوانگژو ، چین-2020
Understanding the relationship between the rail transit ridership and the built environment is crucial to promoting transit-oriented development and sustainable urban growth. Geographically weighted regression (GWR) models have previously been employed to reveal the spatial differences in such relationships at the station level. However, few studies characterized the built environment at a fine scale and associated them with rail transit usage. Moreover, none of the existing studies attempted to categorize the stations for policy-making considering varying impacts of the built environment. In this study, taking Guangzhou as an example, we integrated multisource spatial big data, such as high spatial resolution remote sensing images, points of interest (POIs), social media and building footprint data to precisely quantify the characteristics of the built environment. This was combined with a GWR model to understand how the impacts of the fine-scale built environment factors on the rail transit ridership vary across the study region. The k-means clustering method was employed to identify distinct station groups based on the coefficients of the GWR model at the local stations. Policy zoning was proposed based on the results and differentiated planning guidance was suggested for different zones. These recommendations are expected to help increase rail transit usage, inform rail transit planning (to relieve the traffic burden on currently crowed lines), and re-allocate industrial and living facilities to reduce the commute for the residents. The policy and planning implications are crucial for the coordinated development of the rail transit system and land use.
Keywords: Transit ridership | Built environment | Geographically weighted regression | K-means | Guangzhou
Privacy-preserving clustering for big data in cyber-physical-social systems: Survey and perspectives
خوشه بندی حفظ حریم خصوصی برای داده های بزرگ در سیستم های سایبر-فیزیکی-اجتماعی: بررسی و چشم انداز-2020
Clustering technique plays a critical role in data mining, and has received great success to solve application problems like community analysis, image retrieval, personalized rec- ommendation, activity prediction, etc. This paper first reviews the traditional clustering and the emerging multiple clustering methods, respectively. Although the existing meth- ods have superior performance on some small or certain datasets, they fall short when clustering is performed on CPSS big data because of the high cost of computation and stor- age. With the powerful cloud computing, this challenge can be effectively addressed, but it brings enormous threat to individual or company’s privacy. Currently, privacy preserving data mining has attracted widespread attention in academia. Compared to other reviews, this paper focuses on privacy preserving clustering technique, guiding a detailed overview and discussion. Specifically, we introduce a novel privacy-preserving tensor-based multi- ple clustering, propose a privacy-preserving tensor-based multiple clustering analytic and service framework, and give an illustrated case study on the public transportation dataset. Furthermore, we indicate the remaining challenges of privacy preserving clustering and discuss the future significant research in this area.
Keywords: CPSS | Big data | Cloud computing | Privacy preserving | Clustering