An experimental survey on big data frameworks
یک بررسی تجربی در چارچوب داده های بزرگ-2018
Recently, increasingly large amounts of data are generated from a variety of sources. Existing data pro cessing technologies are not suitable to cope with the huge amounts of generated data. Yet, many research works focus on Big Data, a buzzword referring to the processing of massive volumes of (unstructured) data. Recently proposed frameworks for Big Data applications help to store, analyze and process the data. In this paper, we discuss the challenges of Big Data and we survey existing Big Data frameworks. We also present an experimental evaluation and a comparative study of the most popular Big Data frameworks with several representative batch and iterative workloads. This survey is concluded with a presentation of best practices related to the use of studied frameworks in several application domains such as machine learning, graph processing and real-world applications.
Keywords: Big data ، MapReduce ، Hadoop ، HDFS ، Spark ، Flink ، Storm ،Samza ، Batch/stream processing
A Comparison of Big Remote Sensing Data Processing with Hadoop MapReduce and Spark
مقايسه پردازش داده های حسی راه دور با MapReduce و Spark Hadoop-2018
The continuous generation of huge amount of re mote sensing (RS) data is becoming a challenging task for researchers due to the 4 Vs characterizing this type of data (volume, variety, velocity and veracity). Many platforms have been proposed to deal with big data in RS field. This paper focus on the comparison of two well-known platforms of big RS data namely Hadoop and Spark. We start by describing the two platforms Hadoop and Spark. The first platform is designed for processing enormous unstructured data in a distributed comput ing environment. It is composed of two basic elements : 1) Hadoop Distributed file system for storage, and 2) Mapreduce and Yarn for parallel processing, scheduling the jobs and analyzing big RS data. The second platform, Spark, is composed by a set of libraries and uses the resilient distributed data set to overcome the computational complexity. The last part of this paper is devoted to a comparison between the two platforms.
Index Terms : Big Data, Architectures, Hadoop, Spark, Remote Sensing Image
Enhancing water system models by integrating big data
افزایش مدل های سیستم آب با ادغام داده های بزرگ-2018
The past quarter century has witnessed development of advanced modeling approaches, such as stochastic and agent-based modeling, to sustainably manage water systems in the presence of deep uncertainty and complexity. However, all too often data inputs for these powerful models are sparse and outdated, yielding unreliable results. Advancements in sensor and communication technologies have allowed for the ubiquitous deployment of sen sors in water resources systems and beyond, providing high-frequency data. Processing the large amount of heterogeneous data collected is non-trivial and exceeds the capacity of traditional data warehousing and pro cessing approaches. In the past decade, significant advances have been made in the storage, distribution, querying, and analysis of big data. Many tools have been developed by computer and data scientists to facilitate the manipulation of large datasets and create pipelines to transmit the data from data warehouses to compu tational analytic tools. A generic framework is presented to complete the data cycle for a water system. The data cycle presents an approach for integrating high-frequency data into existing water-related models and analyses, while highlighting some of the more helpful data management tools. The data tools are helpful to make sus tainable decisions, which satisfy the objectives of a society. Data analytics distribution tool Spark is introduced through the illustrative application of coupling high-frequency demand metering data with a water distribution model. By updating the model in near real-time, the analysis is more accurate and can expose serious mis interpretations.
Keywords: Water systems , Modeling , Big data , Automation , Hadoop , Apache Spark , Cloud computing
سیستم تشخیص نفوذ توزیع شده برای محیط های ابری بر اساس تکنیک های داده کاوی
سال انتشار: 2018 - تعداد صفحات فایل pdf انگلیسی: 7 - تعداد صفحات فایل doc فارسی: 16
تقریبا دو دهه بعد از ظهور انها؛ محاسبات ابری همچنان در میان سازمان ها و کاربران فردی در حال افزایش است. بسیاری از مسائل امنیتی همراه انتقال برای این الگوی محاسباتی شامل تشخیص نفوذ به وجود می اید. ابزارهای حمله و نفوذ با شکستن سیستم های تشخیص نفوذ سنتی (IDS) با مقدار زیادی از اطلاعات ترافیک شبکه و رفتارهای پویا پیچیده تر شده است. IDSs ابری موجود از کمبود دقت تشخیص؛ نرخ مثبت کاذب بالا و زمان اجرای بالا رنج می برد. در این مقاله ما یک یادگیری توزیع ماشینی بر مبنی سیستم تشخیص نفوذ برای محیط های ابری را ارائه می دهیم. سیستم پیشنهاد شده برای مندرجات در سمت ابری به وسیله اندازه همراه اجزای شبکه لبه از ابرهای ارائه شده است. اینها به ترافیک رهگیری شبکه های ورودی به لبه شبکه routers از از لایه فیزیکی اجازه می دهد. یک الگوریتم پنجره کشویی (sliding window) مبتنی بر زمان برای پیش پردازش شبکه گرفتار ترافیک در هر router ابری استفاده می شود و سپس در نمونه تشخیص ناهنجاری دسته بندی Naive Bayes استفاده می شود. یک مجموعه از گره های سرور کالا بر مبنی یک Hadoop و MapReduce برای هر نمونه تشخیص ناهنجاری از زمانی که تراکم شبکه افزایش می یابد؛ در دسترس است. برای هر پنجره زمانی؛ داده ترافیک ناهنجاری شبکه در هر طرف router برای یک سرور ذخیره سازی مرکزی هماهنگ شده است. بعد؛ یک طبقه بندی یادگیری گروهی بر مبنی یک Forest تصادفی برای اجرای یک مرحله دسته بندی چند کلاسه نهایی به منظور تشخیص انواعی از هر حمله استفاده می شود.
لغات کلیدی: سیستم های تشخیص نفوذ | محاسبات ابری | یادگیری ماشین | هادوپ | MapReduce
|مقاله ترجمه شده|
Spatial cumulative sum algorithm with big data analytics for climate change detection
الگوریتم مجموع تجمعی فضایی با تجزیه و تحلیل داده های بزرگ برای تشخیص تغییرات اقلیمی-2018
Big data plays a vital role in the prediction of diseases that occur due to climate change. For such predictions, scalable data storage platforms and efficient change detection algo rithms are required to monitor the climate change. However, traditional data storage tech niques and algorithms are not applicable to process the huge amount of climate data. This paper presents a scalable data processing framework with a novel change detection al gorithm. The large volume of climate data is stored on Hadoop Distributed File System (HDFS) and MapReduce algorithm is applied to calculate the seasonal average of climate parameters. Spatial autocorrelation based climate change detection algorithm is proposed in this paper to monitor the changes in the seasonal climate. The proposed climate change detection algorithm is compared with various existing approaches such as pruned exact linear time method, binary segmentation method, and segment neighborhood method.
Keywords: Hadoop Distributed File System ، Big data ،Climate change ، Data analytics ، Weather sensor data
A Big Data Scale Algorithm for Optimal Scheduling of Integrated Microgrids
الگوریتم مقیاس داده های بزرگ برای زمانبندی بهینه میکرو شبکه های یکپارچه-2018
The capability of switching into the islanded operation mode of microgrids has been advocated as a viable solution to achieve high system reliability. This paper proposes a new model for the microgrids optimal scheduling and load curtailment problem. The proposed problem determines the optimal schedule for local generators of microgrids to minimize the generation cost of the associated distribution system in the normal operation. Moreover, when microgrids have to switch into the islanded operation mode due to reliability considerations, the optimal generation solution still guarantees for the minimal amount of load curtailment. Due to the large number of constraints in both normal and islanded operations, the formulated problem becomes a large-scale optimization problem and is very challenging to solve using the centralized computational method. Therefore, we propose a decomposition algorithm using the alternating direction method of multipliers that provides a parallel computational framework. The simulation results demonstrate the efficiency of our proposed model in reducing generation cost, as well as guaranteeing the reliable operation of microgrids in the islanded mode. We finally describe the detailed implementation of parallel computation for our proposed algorithm to run on a computer cluster using the Hadoop MapReduce software framework.
Index Terms: Alternating direction method of multipliers (ADMM), big data, Hadoop, integrated microgrid, islanded operation, load curtailment, MapReduce
Apriori Versions Based on MapReduce for Mining Frequent Patterns on Big Data
نسخه های Apriori بر اساس MapReduce برای اگو کاوی مکرر بر روی داده های بزرگ-2018
Pattern mining is one of the most important tasks to extract meaningful and useful information from raw data. This task aims to extract item-sets that represent any type of homogeneity and regularity in data. Although many efficient algorithms have been developed in this regard, the growing interest in data has caused the performance of existing pattern mining techniques to be dropped. The goal of this paper is to propose new efficient pattern mining algorithms to work in big data. To this aim, a series of algorithms based on the MapReduce framework and the Hadoop open-source implementation have been proposed. The proposed algorithms can be divided into three main groups. First, two algorithms [Apriori MapReduce (AprioriMR) and iterative AprioriMR] with no pruning strategy are proposed, which extract any existing itemset in data. Second, two algorithms (space pruning AprioriMR and top AprioriMR) that prune the search space by means of the well-known anti-monotone property are proposed. Finally, a last algorithm (maximal AprioriMR) is also proposed for mining condensed representations of frequent patterns. To test the performance of the proposed algorithms, a varied collection of big data datasets have been considered, comprising up to 3·1018 transactions and more than 5 million of distinct single-items. The experimental stage includes comparisons against highly efficient and well-known pattern mining algorithms. Results reveal the interest of applying MapReduce versions when complex problems are considered, and also the unsuitability of this paradigm when dealing with small data.
Index Terms: Big data, Hadoop, MapReduce, pattern mining
Selective I/O Bypass and Load Balancing Method for Write-Through SSD Caching in Big Data Analytics
روش تعادل بار بای پس ورودی خروجی انتخابی برای نوشتن از طریق حافظه SSD در تجزیه و تحلیل داده های بزرگ-2018
Fast network quality analysis in the telecom industry is an important method used to provide quality service. SK Telecom, based in South Korea, built a Hadoop-based analytical system consisting of a hundred nodes, each of which only contains hard disk drives (HDDs). Because the analysis process is a set of parallel I/O intensive jobs, adding solid state drives (SSDs) with appropriate settings is the most cost-efficient way to improve the performance, as shown in previous studies. Therefore, we decided to configure SSDs as a write-through cache instead of increasing the number of HDDs. To improve the cost-perperformance of the SSD cache, we introduced a selective I/O bypass (SIB) method, redirecting the automatically calculated number of read I/O requests from the SSD cache to idle HDDs when the SSDs are I/O over-saturated, which means the disk utilization is greater than 100 percent. To precisely calculate the disk utilization, we also introduced a combinational approach for SSDs because the current method used for HDDs cannot be applied to SSDs because of their internal parallelism. In our experiments, the proposed approach achieved a maximum 2x faster performance than other approaches.
Index Terms: I/O load balancing, SQL-on-Hadoop, SSD cache, storage hierarchies
Clustering big IoT data by metaheuristic optimized mini-batch and parallel partition-based DGC in Hadoop
خوشه بندی داده های اینترنت اشیا بزرگ توسط بهینه سازی ماتریس های متمرکز و DGC مبتنی بر پارتیشن موازی در Hadoop-2018
Clustering algorithms are an important branch of data mining family which has been applied widely in IoT applications such as finding similar sensing patterns, detecting outliers, and segmenting large behavioral groups in real-time. Traditional full batch k-means for clustering IoT big data is confronted by large scaled storage and high computational complexity problems. In order to overcome the latency inherited from full batch k-means, two big data processing methods were often used: the first method is to use small batches as the input data to multiple computers for reducing the computation efforts. However, depending on the sensed data which may be heterogeneously fused from different sources in an IoT network, the size of each mini batch may vary in each iteration of clustering process. When these input data are subject to clustering their centers would shift drastically, which affects the final clustering results. The second method is parallel computing, it decreases the runtime while the overall computational effort remains the same. Furthermore, some centroid based clustering algorithm such as k-means converges easily into local optima. In light of this, in this paper, a new partitioned clustering method that is optimized by metaheuristic is proposed for IoT big data environment. The method has three main activities: Firstly, a sample of the dataset is partitioned into mini batches. It is followed by adjusting the centroids of the mini batches of data. The third step is collating the mini batches to form clusters, so the quality of the clusters would be maximized. How the positions of the centroids could be optimally attuned at the mini batches are governed by a metaheuristic called Dynamic Group Optimization. The data are processed in parallel in Hadoop. Extensive experiments are conducted to investigate the performance. The results show that our proposed method is a promising tool for clustering fused IoT data efficiently.
Keywords: Metaheuristic ، Partitioning ، Clustering ، Hadoop ، IoT data، Data fusion
In-Mapper combiner based MapReduce algorithm for processing of big climate data
الگوریتم MapReduce مبتنی بر ترکیب Mapper در پردازش داده های آب و هوایی بزرگ -2018
Big data refers to a collection of massive volume of data that cannot be processed by conventional data processing tools and technologies. In recent years, the data production sources are enlarged noticeably, such as high-end streaming devices, wireless sensor networks, satellite, wearable Internet of Things (IoT) devices. These data generation sources generate a massive volume of data in a continuous manner. The large volume of climate data is collected from the IoT weather sensor devices and NCEP. In this paper, the big data processing framework is proposed to integrate climate and health data and to find the correlation between the climate parameters and incidence of dengue. This framework is demonstrated with the help of MapReduce programming model, Hive, HBase and ArcGIS in a Hadoop Distributed File System (HDFS) environment. The following weather parameters such as minimum temperature, maximum temperature, wind, precipitation, solar and relative humidity are collected for the study are Tamil Nadu with the help of IoT weather sensor devices and NCEP. Proposed framework focuses only on climate data for 32 districts of Tamil Nadu where each district contains 1,57,680 rows and so there are 50,45,760 rows in total. Batch view precomputation for the monthly mean of various climate parameters would require 50,45,760 rows. Hence, this would create more latency in query processing. In order to overcome this issue, batch views can precompute for a smaller number of records and involve more computation to be done at query time. The In-Mapper based MapReduce framework is used to compute the monthly mean of climate parameter for each latitude and longitude. The experimental results prove the effectiveness of the response time for the In-Mapper based combiner algorithm is less when compared with the existing MapReduce algorithm.
Keywords: Big data ، Internet of Things ، Weather sensor devices ، MapReduce programming ،Model ، Hadoop distributed file system