A dynamic classification unit for online segmentation of big data via small data buffers
واحد طبقه بندی پویا برای تقسیم آنلاین داده های بزرگ از طریق بافر داده های کوچک-2020
In many segmentation processes, we assign new cases according to a model that was built on the basis of past cases. As long as the new cases are “similar enough” to the past cases, segmentation proceeds normally. However, when a new case is substantially different from the known cases, a reexamination of the previously created segments is required. The reexamination may result in the creation of new segments or in the updating of the existing ones. In this paper, we assume that in big and dynamic data environments it is not possible to reexamine all past data and, therefore, we suggest using small groups of selected cases, stored in small data buffers, as an alternative to the collection of all past data. We present an incremental dynamic classifier that supports real-time unsupervised segmentation in big and dynamic data environments. In order to reduce the computational effort of unsupervised clustering in such environments, the suggested model performs calculations only on the relevant data buffers that store the relevant representative cases. In addition, the suggested model can serve as a dynamic classification unit (DCU) that can act as an autonomous agent, as well as collaborate with other DCUs. The evaluation is presented by comparing three approaches: static, dynamic, and incremental dynamic.
Keywords: Incremental dynamic classifier | Dynamic segmentation | Incremental data analysis | Cluster analysis | Classification | Big data
Quantile regression in big data: A divide and conquer based strategy
رگرسیون کمی در داده های بزرگ: یک استراتژی مبتنی بر تقسیم و غلبه-2020
Quantile regression, which analyzes the conditional distribution of outcomes given a set of covariates, has been widely used in many fields. However, the volume and velocity of big data make the estimation of quantile regression model extremely difficult due to the intensive computation and the limited storage. Based on divide and conquer strategy, a simple and efficient method is proposed to address this problem. The proposed approach only keeps summary statistics of each data block and then can use them to reconstruct the estimator of the entire data with asymptotically negligible approximation error. This property makes the proposed method particularly appealing when data blocks are retained in multiple servers or come in the form of data stream. Furthermore, the proposed estimator is shown to be consistent and asymptotically as efficient as the estimating equation estimator calculated using the entire data together when certain conditions hold. The merits of the proposed method are illustrated using both simulation studies and real data analysis
Keywords: Data stream | Divide and conquer | Estimating equation | Massive data sets | Quantile regression
Bias reduction in the population size estimation of large data sets
کاهش تمایل در برآورد اندازه جمعیت مجموعه داده های بزرگ-2020
Estimation of the population size of large data sets and hard to reach populations can be a significant problem. For example, in the military, manpower is limited and the manual processing of large data sets can be time consuming. In addition, accessing the full population of data may be restricted by factors such as cost, time, and safety. Four new population size estimators are proposed, as extensions of existing methods, and their performances are compared in terms of bias with two existing methods in the big data literature. These would be particularly beneficial in the context of time-critical decisions or actions. The comparison is based on a simulation study and the application to five real network data sets (Twitter, LiveJournal, Pokec, Youtube, Wikipedia Talk). Whilst no single estimator (out of the four proposed) generates the most accurate estimates overall, the proposed estimators are shown to produce more accurate population size estimates for small sample sizes, but in some cases show more variability than existing estimators in the literature.
Keywords: Relative bias | Twitter | Size estimator | Youtube | Random walk sampling
Forecasting third-party mobile payments with implications for customer flow prediction
پیش بینی پرداخت های تلفن همراه شخص ثالث با پیامدهای پیش بینی جریان مشتری-2020
Forecasting customer flow is key for retailers in making daily operational decisions, but small retailers often lack the resources to obtain such forecasts. Rather than forecasting stores’ total customer flows, this research utilizes emerging third-party mobile payment data to provide participating stores with a value-added service by forecasting their share of daily customer flows. These customer transactions using mobile payments can then be utilized further to derive retailers’ total customer flows indirectly, thereby overcoming the constraints that small retailers face. We propose a third-party mobile-paymentplatform centered daily mobile payments forecasting solution based on an extension of the newly-developed Gradient Boosting Regression Tree (GBRT) method which can generate multi-step forecasts for many stores concurrently. Using empirical forecasting experiments with thousands of time series, we show that GBRT, together with a strategy for multi-period-ahead forecasting, provides more accurate forecasts than established benchmarks. Pooling data from the platform across stores leads to benefits relative to analyzing the data individually, thus demonstrating the value of this machine learning application.
Keywords: Analytics | Big data | Customer flow forecasting | Machine learning | Forecasting many time series | Multi-step-ahead forecasting strategy
MISS-D: A fast and scalable framework of medical image storage service based on distributed file system
MISS-D: یک چارچوب سریع و مقیاس پذیر از خدمات ذخیره سازی تصویر پزشکی بر اساس سیستم فایل توزیع شده-2020
Background and Objective Processing of medical imaging big data is deeply challenging due to the size of data, computational complexity, security storage and inherent privacy issues. Traditional picture archiving and communication system, which is an imaging technology used in the healthcare industry, generally uses centralized high performance disk storage arrays in the practical solutions. The existing storage solutions are not suitable for the diverse range of medical imaging big data that needs to be stored reliably and accessed in a timely manner. The economical solution is emerging as the cloud computing which provides scalability, elasticity, performance and better managing cost. Cloud based storage architecture for medical imaging big data has attracted more and more attention in industry and academia. Methods This study presents a novel, fast and scalable framework of medical image storage service based on distributed file system. Two innovations of the framework are introduced in this paper. An integrated medical imaging content indexing file model for large-scale image sequence is designed to adapt to the high performance storage efficiency on distributed file system. A virtual file pooling technology is proposed, which uses the memory-mapped file method to achieve an efficient data reading process and provides the data swapping strategy in the pool. Result The experiments show that the framework not only has comparable performance of reading and writing files which meets requirements in real-time application domain, but also bings greater convenience for clinical system developers by multiple client accessing types. The framework supports different user client types through the unified micro-service interfaces which basically meet the needs of clinical system development especially for online applications. The experimental results demonstrate the framework can meet the needs of real-time data access as well as traditional picture archiving and communication system. Conclusions This framework aims to allow rapid data accessing for massive medical images, which can be demonstrated by the online web client for MISS-D framework implemented in this paper for real-time data interaction. The framework also provides a substantial subset of features to existing open-source and commercial alternatives, which has a wide range of potential applications.
Keywords: Hadoop distributed file system | Data packing | Memory mapping file | Message queue | Micro-service | Medical imaging
City limits in the age of smartphones and urban scaling
محدودیت های شهر در عصر تلفن های هوشمند و مقیاس بندی شهری-2020
Urban planning still lacks appropriate standards to define city boundaries across urban systems. This issue has historically been left to administrative criteria, which can vary significantly across countries and political systems, hindering a comparative analysis across urban systems. However, the wide use of Information and Communication Technologies (ICT) has now allowed the development of new quantitative approaches to unveil how social dynamics relates to urban infrastructure. In fact, ICT provide the potential to portray more accurate descriptions of the urban systems based on the empirical analysis of millions of traces left by urbanites across the city. In this work, we apply computational techniques over a large volume of mobile phone records to define urban boundaries, through the analysis of travel patterns and the trajectory of urban dwellers in conurbations with more than 100,000 inhabitants in Chile. We created and analyzed the network of interconnected places inferred from individual travel trajectories. We then ranked each place using a spectral centrality method. This allowed to identify places of higher concurrency and functional importance for each urban environment. Urban scaling analysis is finally used as a diagnostic tool that allowed to distinguish urban from non-urban spaces. The geographic assessment of our method shows a high congruence with the current and administrative definitions of urban agglomerations in Chile. Our results can potentially be considered as a functional definition of the urban boundary. They also provide a practical implementation of urban scaling and data-driven approaches on cities as complex systems using increasingly larger non-conventional datasets.
Keywords: City boundaries definition | Spectral network analysis | Urban informatics | Social computing | Scaling laws | Complex systems | Big data
Mobile phone network data reveal nationwide economic value of coastal tourism under climate change
ارزش اقتصادی داده های شبکه تلفن همراه در سراسر جهان از گردشگری ساحلی در اثر تغییر آب و هوا-2020
The technology-driven application of big data is expected to assist policymaking towards sustainable development; however, the relevant literature has not addressed human welfare under climate change, which limits the understanding of climate change impacts on human societies. We present the first application of unique mobile phone network data to evaluate the current nation-wide human welfare of coastal tourism at Japanese beaches and project the value change using the four climate change scenarios. The results show that the projected national economic value loss rates are more significant than the projected national physical beach loss rates. Our findings demonstrate regional differences in recreational values: most southern beaches with larger current values would disappear, while the current small values of the northern beaches would remain. These changes imply that the ranks of the beaches, based on economic values, would enable policymakers to discuss management priorities under climate change.
Keywords: Adaptation | Beach recreation | Big data | Climate change | Coastal tourism | Ecosystem services | Travel cost method | Sea level rise
Special interest tourism is not so special after all: Big data evidence from the 2017 Great American Solar Eclipse
جهانگردی با علاقه ویژه از همه مهم تر نیست: شواهد داده های بزرگ از خورشید گرفتگی بزرگ آمریکایی 2017-2020
This study puts to empirical test a major typology in the tourism literature, mass versus special interest tourism (SIT), as the once-distinctive boundary between the two has become blurry in modern tourism scholarship. We utilize 41,747 geo-located Instagram photos pertaining to the 2017 Great American Solar Eclipse and Big Data analytics to distinguish tourists based on their choice of observational destinations and spatial movement patterns. Two types of tourists are identified: opportunists and hardcore. The motivational profile of those tourists is validated with the external data through hypothesis testing and compared with and contrasted against existing motivation-based tourist typologies. The main conclusion is that large share of tourists involved in what is traditionally understood as SIT activities exhibit behavior and profile characteristic of mass tourists seeking novelty but conscious about risks and comforts. Practical implications regarding the potential of rural and urban destinations for developing SIT tourism are also discussed.
Keywords: Big data | Instagram photos | Social media | Spatial analysis | Special interest tourism | Astro-tourism
Intelligent condition assessment of industry machinery using multiple type of signal from monitoring system
ارزیابی شرایط هوشمند ماشین آلات صنعت با استفاده از چندین نوع سیگنال از سیستم نظارت-2020
Real time condition assessment for machinery is used for avoiding catastrophic failures. A new strategy which combined data processing with data-driven method is presented for condition assessment of machinery based on multiple characteristic parameters of industrial equipment. Firstly, the data processing is carried out, including the industrial data cleaning, the correlation analysis using the Bin method and the condition division. The vibration parameters, which are sensitive to the state changes of the machine, are assumed as data binning reference. Secondly, the multi-parameter condition evaluation technique is proposed by using Hidden Markov Model. The industrial big data collected from monitoring system are analyzed and the site test is conducted finally. The results show that the provided technique can not only evaluate the running condition of the machinery, but also reflect the change of the operational condition. It can exhibit a potential capability in tracing further deterioration of the machine
Keywords: Industrial machinery | Monitoring system | Condition assessment | Correlation analysis | Hidden Markov Model
Challenges and recommended technologies for the industrial internet of things: A comprehensive review
چالش ها و فن آوری های پیشنهادی برای اینترنت اشیا صنعتی: مرور جامع-2020
Physical world integration with cyber world opens the opportunity of creating smart environments; this new paradigm is called the Internet of Things (IoT). Communication between humans and objects has been extended into those between objects and objects. Industrial IoT (IIoT) takes benefits of IoT communications in business applications focusing in interoperability between machines (i.e., IIoT is a subset from the IoT). Number of daily life things and objects connected to the Internet has been in increasing fashion, which makes the IoT be the dynamic network of networks. Challenges such as heterogeneity, dynamicity, velocity, and volume of data, make IoT services produce inconsistent, inaccurate, incomplete, and incorrect results, which are critical for many applications especially in IIoT (e.g., health-care, smart transportation, wearable, finance, industry, etc.). Discovering, searching, and sharing data and resources reveal 40% of IoT benefits to cover almost industrial applications. Enabling real-time data analysis, knowledge extraction, and search techniques based on Information Communication Technologies (ICT), such as data fusion, machine learning, big data, cloud computing, blockchain, etc., can reduce and control IoT and leverage its value. This research presents a comprehensive review to study state-of-the-art challenges and recommended technologies for enabling data analysis and search in the future IoT presenting a framework for ICT integration in IoT layers. This paper surveys current IoT search engines (IoTSEs) and presents two case studies to reflect promising enhancements on intelligence and smartness of IoT applications due to ICT integration.
Keywords: Industrial IoT (IIoT) | Searching and indexing | Blockchain | Big data | Data fusion Machine learning | Cloud and fog computing