A dynamic classification unit for online segmentation of big data via small data buffers
واحد طبقه بندی پویا برای تقسیم آنلاین داده های بزرگ از طریق بافر داده های کوچک-2020
In many segmentation processes, we assign new cases according to a model that was built on the basis of past cases. As long as the new cases are “similar enough” to the past cases, segmentation proceeds normally. However, when a new case is substantially different from the known cases, a reexamination of the previously created segments is required. The reexamination may result in the creation of new segments or in the updating of the existing ones. In this paper, we assume that in big and dynamic data environments it is not possible to reexamine all past data and, therefore, we suggest using small groups of selected cases, stored in small data buffers, as an alternative to the collection of all past data. We present an incremental dynamic classifier that supports real-time unsupervised segmentation in big and dynamic data environments. In order to reduce the computational effort of unsupervised clustering in such environments, the suggested model performs calculations only on the relevant data buffers that store the relevant representative cases. In addition, the suggested model can serve as a dynamic classification unit (DCU) that can act as an autonomous agent, as well as collaborate with other DCUs. The evaluation is presented by comparing three approaches: static, dynamic, and incremental dynamic.
Keywords: Incremental dynamic classifier | Dynamic segmentation | Incremental data analysis | Cluster analysis | Classification | Big data
Quantile regression in big data: A divide and conquer based strategy
رگرسیون کمی در داده های بزرگ: یک استراتژی مبتنی بر تقسیم و غلبه-2020
Quantile regression, which analyzes the conditional distribution of outcomes given a set of covariates, has been widely used in many fields. However, the volume and velocity of big data make the estimation of quantile regression model extremely difficult due to the intensive computation and the limited storage. Based on divide and conquer strategy, a simple and efficient method is proposed to address this problem. The proposed approach only keeps summary statistics of each data block and then can use them to reconstruct the estimator of the entire data with asymptotically negligible approximation error. This property makes the proposed method particularly appealing when data blocks are retained in multiple servers or come in the form of data stream. Furthermore, the proposed estimator is shown to be consistent and asymptotically as efficient as the estimating equation estimator calculated using the entire data together when certain conditions hold. The merits of the proposed method are illustrated using both simulation studies and real data analysis
Keywords: Data stream | Divide and conquer | Estimating equation | Massive data sets | Quantile regression
Bias reduction in the population size estimation of large data sets
کاهش تمایل در برآورد اندازه جمعیت مجموعه داده های بزرگ-2020
Estimation of the population size of large data sets and hard to reach populations can be a significant problem. For example, in the military, manpower is limited and the manual processing of large data sets can be time consuming. In addition, accessing the full population of data may be restricted by factors such as cost, time, and safety. Four new population size estimators are proposed, as extensions of existing methods, and their performances are compared in terms of bias with two existing methods in the big data literature. These would be particularly beneficial in the context of time-critical decisions or actions. The comparison is based on a simulation study and the application to five real network data sets (Twitter, LiveJournal, Pokec, Youtube, Wikipedia Talk). Whilst no single estimator (out of the four proposed) generates the most accurate estimates overall, the proposed estimators are shown to produce more accurate population size estimates for small sample sizes, but in some cases show more variability than existing estimators in the literature.
Keywords: Relative bias | Twitter | Size estimator | Youtube | Random walk sampling
Democratization of AI, Albeit Constrained IoT Devices & Tiny ML, for Creating a Sustainable Food Future
دموکراتیک سازی هوش مصنوعی ، دستگاه های محدود IoT و Tiny ML ، برای ایجاد آینده غذایی پایدار-2020
Abstract—Big Data surrounds us. Every minute, our smartphone collects huge amount of data from geolocations to next clickable item on the ecommerce site. Data has become one of the most important commodities for the individuals and companies. Nevertheless, this data revolution has not touched every economic sector, especially rural economies, e.g., small farmers have largely passed over the data revolution, in the developing countries due to infrastructure and compute constrained environments. Not only this is a huge missed opportunity for the big data companies, it is one of the significant obstacle in the path towards sustainable food and a huge inhibitor closing economic disparities. The purpose of the paper is to develop a framework to deploy artificial intelligence models in constrained compute environments that enable remote rural areas and small farmers to join the data revolution and start contribution to the digital economy and empowers the world through the data to create a sustainable food for our collective future.
Keywords: edge | IoT device | artificial intelligence | Kalman filter | dairy cloud | small scale farmers | hardware constrained model | tiny ML| Hanumayamma | cow necklace
Forecasting third-party mobile payments with implications for customer flow prediction
پیش بینی پرداخت های تلفن همراه شخص ثالث با پیامدهای پیش بینی جریان مشتری-2020
Forecasting customer flow is key for retailers in making daily operational decisions, but small retailers often lack the resources to obtain such forecasts. Rather than forecasting stores’ total customer flows, this research utilizes emerging third-party mobile payment data to provide participating stores with a value-added service by forecasting their share of daily customer flows. These customer transactions using mobile payments can then be utilized further to derive retailers’ total customer flows indirectly, thereby overcoming the constraints that small retailers face. We propose a third-party mobile-paymentplatform centered daily mobile payments forecasting solution based on an extension of the newly-developed Gradient Boosting Regression Tree (GBRT) method which can generate multi-step forecasts for many stores concurrently. Using empirical forecasting experiments with thousands of time series, we show that GBRT, together with a strategy for multi-period-ahead forecasting, provides more accurate forecasts than established benchmarks. Pooling data from the platform across stores leads to benefits relative to analyzing the data individually, thus demonstrating the value of this machine learning application.
Keywords: Analytics | Big data | Customer flow forecasting | Machine learning | Forecasting many time series | Multi-step-ahead forecasting strategy
MISS-D: A fast and scalable framework of medical image storage service based on distributed file system
MISS-D: یک چارچوب سریع و مقیاس پذیر از خدمات ذخیره سازی تصویر پزشکی بر اساس سیستم فایل توزیع شده-2020
Background and Objective Processing of medical imaging big data is deeply challenging due to the size of data, computational complexity, security storage and inherent privacy issues. Traditional picture archiving and communication system, which is an imaging technology used in the healthcare industry, generally uses centralized high performance disk storage arrays in the practical solutions. The existing storage solutions are not suitable for the diverse range of medical imaging big data that needs to be stored reliably and accessed in a timely manner. The economical solution is emerging as the cloud computing which provides scalability, elasticity, performance and better managing cost. Cloud based storage architecture for medical imaging big data has attracted more and more attention in industry and academia. Methods This study presents a novel, fast and scalable framework of medical image storage service based on distributed file system. Two innovations of the framework are introduced in this paper. An integrated medical imaging content indexing file model for large-scale image sequence is designed to adapt to the high performance storage efficiency on distributed file system. A virtual file pooling technology is proposed, which uses the memory-mapped file method to achieve an efficient data reading process and provides the data swapping strategy in the pool. Result The experiments show that the framework not only has comparable performance of reading and writing files which meets requirements in real-time application domain, but also bings greater convenience for clinical system developers by multiple client accessing types. The framework supports different user client types through the unified micro-service interfaces which basically meet the needs of clinical system development especially for online applications. The experimental results demonstrate the framework can meet the needs of real-time data access as well as traditional picture archiving and communication system. Conclusions This framework aims to allow rapid data accessing for massive medical images, which can be demonstrated by the online web client for MISS-D framework implemented in this paper for real-time data interaction. The framework also provides a substantial subset of features to existing open-source and commercial alternatives, which has a wide range of potential applications.
Keywords: Hadoop distributed file system | Data packing | Memory mapping file | Message queue | Micro-service | Medical imaging
AI Down on the Farm
هوش مصنوعی کوچک در مزرعه-2020
Agriculture has become an information-intensive industry. In the production of crops and animals, precision agriculture approaches have resulted in the collection of spatially and temporally dense datasets by farmers and agricultural researchers. These big datasets, often characterized by extensive nonlinearities and interactions, are often best analyzed using machine learning (ML) or other artificial intelligence (AI) approaches. In this article, we review several case studies where ML has been used to model aspects of agricultural production systems and provide information useful for farm-level management decisions. These studies include modeling animal feeding behavior as a predictor of stress or disease, providing information important for developing precise and efficient irrigation systems, and enhancing tools used to recommend optimum levels of nitrogen fertilization for corn. Taken together, these examples represent the current abilities and future potential for AI applications in agricultural production systems.
City limits in the age of smartphones and urban scaling
محدودیت های شهر در عصر تلفن های هوشمند و مقیاس بندی شهری-2020
Urban planning still lacks appropriate standards to define city boundaries across urban systems. This issue has historically been left to administrative criteria, which can vary significantly across countries and political systems, hindering a comparative analysis across urban systems. However, the wide use of Information and Communication Technologies (ICT) has now allowed the development of new quantitative approaches to unveil how social dynamics relates to urban infrastructure. In fact, ICT provide the potential to portray more accurate descriptions of the urban systems based on the empirical analysis of millions of traces left by urbanites across the city. In this work, we apply computational techniques over a large volume of mobile phone records to define urban boundaries, through the analysis of travel patterns and the trajectory of urban dwellers in conurbations with more than 100,000 inhabitants in Chile. We created and analyzed the network of interconnected places inferred from individual travel trajectories. We then ranked each place using a spectral centrality method. This allowed to identify places of higher concurrency and functional importance for each urban environment. Urban scaling analysis is finally used as a diagnostic tool that allowed to distinguish urban from non-urban spaces. The geographic assessment of our method shows a high congruence with the current and administrative definitions of urban agglomerations in Chile. Our results can potentially be considered as a functional definition of the urban boundary. They also provide a practical implementation of urban scaling and data-driven approaches on cities as complex systems using increasingly larger non-conventional datasets.
Keywords: City boundaries definition | Spectral network analysis | Urban informatics | Social computing | Scaling laws | Complex systems | Big data
Mobile phone network data reveal nationwide economic value of coastal tourism under climate change
ارزش اقتصادی داده های شبکه تلفن همراه در سراسر جهان از گردشگری ساحلی در اثر تغییر آب و هوا-2020
The technology-driven application of big data is expected to assist policymaking towards sustainable development; however, the relevant literature has not addressed human welfare under climate change, which limits the understanding of climate change impacts on human societies. We present the first application of unique mobile phone network data to evaluate the current nation-wide human welfare of coastal tourism at Japanese beaches and project the value change using the four climate change scenarios. The results show that the projected national economic value loss rates are more significant than the projected national physical beach loss rates. Our findings demonstrate regional differences in recreational values: most southern beaches with larger current values would disappear, while the current small values of the northern beaches would remain. These changes imply that the ranks of the beaches, based on economic values, would enable policymakers to discuss management priorities under climate change.
Keywords: Adaptation | Beach recreation | Big data | Climate change | Coastal tourism | Ecosystem services | Travel cost method | Sea level rise
Special interest tourism is not so special after all: Big data evidence from the 2017 Great American Solar Eclipse
جهانگردی با علاقه ویژه از همه مهم تر نیست: شواهد داده های بزرگ از خورشید گرفتگی بزرگ آمریکایی 2017-2020
This study puts to empirical test a major typology in the tourism literature, mass versus special interest tourism (SIT), as the once-distinctive boundary between the two has become blurry in modern tourism scholarship. We utilize 41,747 geo-located Instagram photos pertaining to the 2017 Great American Solar Eclipse and Big Data analytics to distinguish tourists based on their choice of observational destinations and spatial movement patterns. Two types of tourists are identified: opportunists and hardcore. The motivational profile of those tourists is validated with the external data through hypothesis testing and compared with and contrasted against existing motivation-based tourist typologies. The main conclusion is that large share of tourists involved in what is traditionally understood as SIT activities exhibit behavior and profile characteristic of mass tourists seeking novelty but conscious about risks and comforts. Practical implications regarding the potential of rural and urban destinations for developing SIT tourism are also discussed.
Keywords: Big data | Instagram photos | Social media | Spatial analysis | Special interest tourism | Astro-tourism