A dynamic classification unit for online segmentation of big data via small data buffers
واحد طبقه بندی پویا برای تقسیم آنلاین داده های بزرگ از طریق بافر داده های کوچک-2020
In many segmentation processes, we assign new cases according to a model that was built on the basis of past cases. As long as the new cases are “similar enough” to the past cases, segmentation proceeds normally. However, when a new case is substantially different from the known cases, a reexamination of the previously created segments is required. The reexamination may result in the creation of new segments or in the updating of the existing ones. In this paper, we assume that in big and dynamic data environments it is not possible to reexamine all past data and, therefore, we suggest using small groups of selected cases, stored in small data buffers, as an alternative to the collection of all past data. We present an incremental dynamic classifier that supports real-time unsupervised segmentation in big and dynamic data environments. In order to reduce the computational effort of unsupervised clustering in such environments, the suggested model performs calculations only on the relevant data buffers that store the relevant representative cases. In addition, the suggested model can serve as a dynamic classification unit (DCU) that can act as an autonomous agent, as well as collaborate with other DCUs. The evaluation is presented by comparing three approaches: static, dynamic, and incremental dynamic.
Keywords: Incremental dynamic classifier | Dynamic segmentation | Incremental data analysis | Cluster analysis | Classification | Big data
Quantile regression in big data: A divide and conquer based strategy
رگرسیون کمی در داده های بزرگ: یک استراتژی مبتنی بر تقسیم و غلبه-2020
Quantile regression, which analyzes the conditional distribution of outcomes given a set of covariates, has been widely used in many fields. However, the volume and velocity of big data make the estimation of quantile regression model extremely difficult due to the intensive computation and the limited storage. Based on divide and conquer strategy, a simple and efficient method is proposed to address this problem. The proposed approach only keeps summary statistics of each data block and then can use them to reconstruct the estimator of the entire data with asymptotically negligible approximation error. This property makes the proposed method particularly appealing when data blocks are retained in multiple servers or come in the form of data stream. Furthermore, the proposed estimator is shown to be consistent and asymptotically as efficient as the estimating equation estimator calculated using the entire data together when certain conditions hold. The merits of the proposed method are illustrated using both simulation studies and real data analysis
Keywords: Data stream | Divide and conquer | Estimating equation | Massive data sets | Quantile regression
Bias reduction in the population size estimation of large data sets
کاهش تمایل در برآورد اندازه جمعیت مجموعه داده های بزرگ-2020
Estimation of the population size of large data sets and hard to reach populations can be a significant problem. For example, in the military, manpower is limited and the manual processing of large data sets can be time consuming. In addition, accessing the full population of data may be restricted by factors such as cost, time, and safety. Four new population size estimators are proposed, as extensions of existing methods, and their performances are compared in terms of bias with two existing methods in the big data literature. These would be particularly beneficial in the context of time-critical decisions or actions. The comparison is based on a simulation study and the application to five real network data sets (Twitter, LiveJournal, Pokec, Youtube, Wikipedia Talk). Whilst no single estimator (out of the four proposed) generates the most accurate estimates overall, the proposed estimators are shown to produce more accurate population size estimates for small sample sizes, but in some cases show more variability than existing estimators in the literature.
Keywords: Relative bias | Twitter | Size estimator | Youtube | Random walk sampling
Forecasting third-party mobile payments with implications for customer flow prediction
پیش بینی پرداخت های تلفن همراه شخص ثالث با پیامدهای پیش بینی جریان مشتری-2020
Forecasting customer flow is key for retailers in making daily operational decisions, but small retailers often lack the resources to obtain such forecasts. Rather than forecasting stores’ total customer flows, this research utilizes emerging third-party mobile payment data to provide participating stores with a value-added service by forecasting their share of daily customer flows. These customer transactions using mobile payments can then be utilized further to derive retailers’ total customer flows indirectly, thereby overcoming the constraints that small retailers face. We propose a third-party mobile-paymentplatform centered daily mobile payments forecasting solution based on an extension of the newly-developed Gradient Boosting Regression Tree (GBRT) method which can generate multi-step forecasts for many stores concurrently. Using empirical forecasting experiments with thousands of time series, we show that GBRT, together with a strategy for multi-period-ahead forecasting, provides more accurate forecasts than established benchmarks. Pooling data from the platform across stores leads to benefits relative to analyzing the data individually, thus demonstrating the value of this machine learning application.
Keywords: Analytics | Big data | Customer flow forecasting | Machine learning | Forecasting many time series | Multi-step-ahead forecasting strategy
City limits in the age of smartphones and urban scaling
محدودیت های شهر در عصر تلفن های هوشمند و مقیاس بندی شهری-2020
Urban planning still lacks appropriate standards to define city boundaries across urban systems. This issue has historically been left to administrative criteria, which can vary significantly across countries and political systems, hindering a comparative analysis across urban systems. However, the wide use of Information and Communication Technologies (ICT) has now allowed the development of new quantitative approaches to unveil how social dynamics relates to urban infrastructure. In fact, ICT provide the potential to portray more accurate descriptions of the urban systems based on the empirical analysis of millions of traces left by urbanites across the city. In this work, we apply computational techniques over a large volume of mobile phone records to define urban boundaries, through the analysis of travel patterns and the trajectory of urban dwellers in conurbations with more than 100,000 inhabitants in Chile. We created and analyzed the network of interconnected places inferred from individual travel trajectories. We then ranked each place using a spectral centrality method. This allowed to identify places of higher concurrency and functional importance for each urban environment. Urban scaling analysis is finally used as a diagnostic tool that allowed to distinguish urban from non-urban spaces. The geographic assessment of our method shows a high congruence with the current and administrative definitions of urban agglomerations in Chile. Our results can potentially be considered as a functional definition of the urban boundary. They also provide a practical implementation of urban scaling and data-driven approaches on cities as complex systems using increasingly larger non-conventional datasets.
Keywords: City boundaries definition | Spectral network analysis | Urban informatics | Social computing | Scaling laws | Complex systems | Big data
Mobile phone network data reveal nationwide economic value of coastal tourism under climate change
ارزش اقتصادی داده های شبکه تلفن همراه در سراسر جهان از گردشگری ساحلی در اثر تغییر آب و هوا-2020
The technology-driven application of big data is expected to assist policymaking towards sustainable development; however, the relevant literature has not addressed human welfare under climate change, which limits the understanding of climate change impacts on human societies. We present the first application of unique mobile phone network data to evaluate the current nation-wide human welfare of coastal tourism at Japanese beaches and project the value change using the four climate change scenarios. The results show that the projected national economic value loss rates are more significant than the projected national physical beach loss rates. Our findings demonstrate regional differences in recreational values: most southern beaches with larger current values would disappear, while the current small values of the northern beaches would remain. These changes imply that the ranks of the beaches, based on economic values, would enable policymakers to discuss management priorities under climate change.
Keywords: Adaptation | Beach recreation | Big data | Climate change | Coastal tourism | Ecosystem services | Travel cost method | Sea level rise
Special interest tourism is not so special after all: Big data evidence from the 2017 Great American Solar Eclipse
جهانگردی با علاقه ویژه از همه مهم تر نیست: شواهد داده های بزرگ از خورشید گرفتگی بزرگ آمریکایی 2017-2020
This study puts to empirical test a major typology in the tourism literature, mass versus special interest tourism (SIT), as the once-distinctive boundary between the two has become blurry in modern tourism scholarship. We utilize 41,747 geo-located Instagram photos pertaining to the 2017 Great American Solar Eclipse and Big Data analytics to distinguish tourists based on their choice of observational destinations and spatial movement patterns. Two types of tourists are identified: opportunists and hardcore. The motivational profile of those tourists is validated with the external data through hypothesis testing and compared with and contrasted against existing motivation-based tourist typologies. The main conclusion is that large share of tourists involved in what is traditionally understood as SIT activities exhibit behavior and profile characteristic of mass tourists seeking novelty but conscious about risks and comforts. Practical implications regarding the potential of rural and urban destinations for developing SIT tourism are also discussed.
Keywords: Big data | Instagram photos | Social media | Spatial analysis | Special interest tourism | Astro-tourism
Challenges and recommended technologies for the industrial internet of things: A comprehensive review
چالش ها و فن آوری های پیشنهادی برای اینترنت اشیا صنعتی: مرور جامع-2020
Physical world integration with cyber world opens the opportunity of creating smart environments; this new paradigm is called the Internet of Things (IoT). Communication between humans and objects has been extended into those between objects and objects. Industrial IoT (IIoT) takes benefits of IoT communications in business applications focusing in interoperability between machines (i.e., IIoT is a subset from the IoT). Number of daily life things and objects connected to the Internet has been in increasing fashion, which makes the IoT be the dynamic network of networks. Challenges such as heterogeneity, dynamicity, velocity, and volume of data, make IoT services produce inconsistent, inaccurate, incomplete, and incorrect results, which are critical for many applications especially in IIoT (e.g., health-care, smart transportation, wearable, finance, industry, etc.). Discovering, searching, and sharing data and resources reveal 40% of IoT benefits to cover almost industrial applications. Enabling real-time data analysis, knowledge extraction, and search techniques based on Information Communication Technologies (ICT), such as data fusion, machine learning, big data, cloud computing, blockchain, etc., can reduce and control IoT and leverage its value. This research presents a comprehensive review to study state-of-the-art challenges and recommended technologies for enabling data analysis and search in the future IoT presenting a framework for ICT integration in IoT layers. This paper surveys current IoT search engines (IoTSEs) and presents two case studies to reflect promising enhancements on intelligence and smartness of IoT applications due to ICT integration.
Keywords: Industrial IoT (IIoT) | Searching and indexing | Blockchain | Big data | Data fusion Machine learning | Cloud and fog computing
“Familiar strangers” in the big data era: An exploratory study of Beijing metro encounters
"غریبه های آشنا" در عصر داده های بزرگ: یک مطالعه اکتشافی از برخورد مترو پکن-2020
Traditionally, familiar strangers are defined as those we encounter and observe repeatedly in the city but never interact with. They are common to most urban dwellers. They also have various socioeconomic, sociopsychological and public-policy implications, which have only been sporadically mentioned and/or examined in existing studies across different disciplines. In this manuscript, we first summarize fragmental existing studies on familiar strangers that are defined in the traditional manner based on “small data” such as survey responses. Then we reconceptualize “familiar strangers” against the backdrop of the emergence and increased availability of big and open data. Such familiar strangers are called “familiar strangers in the big data era” (FSiBDE). After this, we have done the following: (a) synthesized and hypothesized factors influencing the distribution and quantity of the FSiBDE; (b) conducted an empirical study in the context of Beijing to embody and operationalize a special type of the FSiBDE among metro riders and to study its possible influencers. We find that across metro stations, it is spatial structure, population distribution, and transport network that significantly influence the count and odds of FSiBDE among millions of metro riders. In addition, the FSiBDE also can have important policy and planning implications for operating metro services and managing metro station.
Keywords: Familiar stranger | Big data era | Implications | Odds | Distribution | Beijing
The varying patterns of rail transit ridership and their relationships with fine-scale built environment factors: Big data analytics from Guangzhou
الگوهای مختلف تفریحی حمل و نقل ریلی و روابط آنها با عوامل محیطی ساخته شده در مقیاس خوب: تجزیه و تحلیل داده های بزرگ از گوانگژو-2020
Investigating the varying ridership patterns of rail transit ridership and their influencing factors at the station level is essential for station planning, urban planning, and passenger flow management. Although many studies have investigated the associations between rail transit ridership and built environment, few studies combined spatial big data to characterize the built environment factors at a fine scale and linked those factors with the varying patterns of rail transit ridership. In this study, we characterized the fine-scale built environment factors in the central urban area of Guangzhou, China, by integrating multi-source geospatial big data including Tencent user data, building footprint and stories, points of interest (POI) data and Google Earth high-resolution images. Six direct ridership models (DRMs) based on the backward stepwise regression method were built to compare the different effects between daily, temporal and directional ridership. The results indicated that number of station entrances/exits and transfer dummy, were positively associated with rail transit ridership, while connecting bus station sites and the parking lots were not significantly related to ridership. Population density and common residences land were found to be dominating factors in promoting morning boarding & evening alighting ridership, which implied that these two factors should be focused on to encourage commuting-purpose rail transit usage. However, the indistinct effect of urban villages on rail transit ridership suggested planners to pay more attentions on urban regeneration at the pedestrian catchment areas (PCAs) with urban villages. High employment density and a large FAR were suggested at the employment-oriented areas owing to their importance in promoting rail transit ridership, especially the morning alighting & evening boarding ridership. Moreover, educational research land use significantly affected weekday ridership while sports land use positively influenced weekend ridership, which suggested planners to pay more attention on the non-commuting trips. The different influencing mechanisms of various types of rail transit ridership highlighted the need to consider land use balance planning and trip demand optimization in highly urbanized metropolises in developing countries.
Keywords: Rail transit ridership | Big data | Fine-scale | Built environment | Guangzhou