Toward modeling and optimization of features selection in Big Data based social Internet of Things
به سوی مدل سازی و بهینه سازی انتخاب ویژگی ها در داده های بزرگ مبتنی بر اینترنت اشیا اجتماعی-2018
The growing gap between users and the Big Data analytics requires innovative tools that address the challenges faced by big data volume, variety, and velocity. Therefore, it becomes computationally inefficient to analyze and select features from such massive volume of data. Moreover, advancements in the field of Big Data application and data science poses additional challenges, where a selection of appropriate features and High-Performance Computing (HPC) solution has become a key issue and has attracted attention in recent years. Therefore, keeping in view the needs above, there is a requirement for a system that can efficiently select features and analyze a stream of Big Data within their requirements. Hence, this paper presents a system architecture that selects features by using Artificial Bee Colony (ABC). Moreover, a Kalman filter is used in Hadoop ecosystem that is used for removal of noise. Furthermore, traditional MapReduce with ABC is used that enhance the processing efficiency. Moreover, a complete four-tier architecture is also proposed that efficiently aggregate the data, eliminate unnecessary data, and analyze the data by the proposed Hadoop-based ABC algorithm. To check the efficiency of the proposed algorithms exploited in the proposed system architecture, we have implemented our proposed system using Hadoop and MapReduce with the ABC algorithm. ABC algorithm is used to select features, whereas, MapReduce is supported by a parallel algorithm that efficiently processes a huge volume of data sets. The system is implemented using MapReduce tool at the top of the Hadoop parallel nodes with near real time. Moreover, the proposed system is compared with Swarm approaches and is evaluated regarding efficiency, accuracy and throughput by using ten different data sets. The results show that the proposed system is more scalable and efficient in selecting features.
Keywords: SIoT ، Big Data ، ABC algorithm، Feature selection
Big data requirements in current and next fusion research experiments
الزامات داده های بزرگ در آزمایش های فعلی و بعدی همجوشی-2018
The present and future data management requirements for fusion experiments are presented along with the currently adopted solutions. Even if the presented solution fulfil the requirements of the current experiments, the next generation fusion devices are likely to produce/require an unpreceded amount of data. For this reason, the solutions adopted nowadays, and also foreseen for the experiments under construction, might prove not enough scalable. Information Technology already provides efficient solutions for big data management, successfully employed for large cloud applications and social media. In particular, MongoDB, Cassandra and Hadoop represent promising candidates for the next generation experiments because their combined usage covers the specific data requirements for fusion research.
Keywords: Big Data ; Nuclear Fusion Experiment ; Data Acquisition ; Databases
BDEv 3:0: Energy efficiency and microarchitectural characterization of Big Data processing frameworks
BDEv 3:0: کارایی انرژی و خصوصیات میکروارساختاری چارچوب پردازش داده های بزرگ-2018
As the size of Big Data workloads keeps increasing, the evaluation of distributed frameworks becomes a crucial task in order to identify potential performance bottlenecks that may delay the processing of large datasets. While most of the existing works generally focus only on execution time and resource utilization, analyzing other important metrics is key to fully understanding the behavior of these frameworks. For example, microarchitecture-level events can bring meaningful insights to characterize the interaction between frameworks and hardware. Moreover, energy consumption is also gaining increasing attention as systems scale to thousands of cores. This work discusses the current state of the art in evaluating distributed processing frameworks, while extending our Big Data Evaluator tool (BDEv) to extract energy efficiency and microarchitecture-level metrics from the execution of representative Big Data workloads. An experimental evaluation using BDEv demonstrates its usefulness to bring meaningful information from popular frameworks such as Hadoop, Spark and Flink.
Keywords: Big Data processing, performance evaluation, energy efficiency, microarchitectural characterization
An experimental survey on big data frameworks
یک بررسی تجربی در چارچوب داده های بزرگ-2018
Recently, increasingly large amounts of data are generated from a variety of sources. Existing data pro cessing technologies are not suitable to cope with the huge amounts of generated data. Yet, many research works focus on Big Data, a buzzword referring to the processing of massive volumes of (unstructured) data. Recently proposed frameworks for Big Data applications help to store, analyze and process the data. In this paper, we discuss the challenges of Big Data and we survey existing Big Data frameworks. We also present an experimental evaluation and a comparative study of the most popular Big Data frameworks with several representative batch and iterative workloads. This survey is concluded with a presentation of best practices related to the use of studied frameworks in several application domains such as machine learning, graph processing and real-world applications.
Keywords: Big data ، MapReduce ، Hadoop ، HDFS ، Spark ، Flink ، Storm ،Samza ، Batch/stream processing
A Comparison of Big Remote Sensing Data Processing with Hadoop MapReduce and Spark
مقايسه پردازش داده های حسی راه دور با MapReduce و Spark Hadoop-2018
The continuous generation of huge amount of re mote sensing (RS) data is becoming a challenging task for researchers due to the 4 Vs characterizing this type of data (volume, variety, velocity and veracity). Many platforms have been proposed to deal with big data in RS field. This paper focus on the comparison of two well-known platforms of big RS data namely Hadoop and Spark. We start by describing the two platforms Hadoop and Spark. The first platform is designed for processing enormous unstructured data in a distributed comput ing environment. It is composed of two basic elements : 1) Hadoop Distributed file system for storage, and 2) Mapreduce and Yarn for parallel processing, scheduling the jobs and analyzing big RS data. The second platform, Spark, is composed by a set of libraries and uses the resilient distributed data set to overcome the computational complexity. The last part of this paper is devoted to a comparison between the two platforms.
Index Terms : Big Data, Architectures, Hadoop, Spark, Remote Sensing Image
Efficient jobs scheduling approach for big data applications
رویکرد برنامه ریزی شغلی کارآمد برای داده های بزرگ-2018
The MapReduce framework has become a leading scheme for processing large-scale data applications in recent years. However, big data applications executed on computer clusters require a large amount of energy, which costs a considerable fraction of the data center’s overall costs. Therefore, for a data center, how to reduce the energy consumption becomes a critical issue. Although Hadoop YARN adopts fine-grained resource management schemes for job scheduling, it doesn’t consider the energy saving problem. In this paper, an Energy-aware Fair Scheduling framework based on YARN (denoted as EFS) is proposed, which can effectively reduce energy consumption while meet the required Service Level Agreements (SLAs). EFS not only can schedule jobs to en ergy-efficiency nodes, but also can power on or off the nodes. To do so, the energy-aware dynamic capacity management with deadline-driven policy is used to allocate the resources for MapReduce tasks in terms of the average execution time of containers and users request resources. And then, Energy-aware fair based scheduling problem is modeled as multi-dimensional knapsack problem (MKP) and the energy-aware greedy algorithm (EAGA) is proposed to realize tasks fine-grained placement on energy-efficient nodes. Finally, the nodes which have been kept in idle state for the threshold duration are turned off to reduce energy costs. We perform ex tensive experiments on the Hadoop YARN clusters to compare the energy consumption and executing time of EFS with some state-of-the-art policies. The experimental results show that EFS can not only keep the proper number of nodes in on states to meet the computing requirements but also achieve the goal of energy savings.
Keywords: Big data ، Dynamic scheduling ، Energy efficiency ، MapReduce ، Resource allocation
Scalable system scheduling for HPC and big data
برنامه ریزی مقیاس پذیر برای HPC و داده های بزرگ-2018
In the rapidly expanding field of parallel processing, job schedulers are the ‘‘operating systems’’ of modern big data architectures and supercomputing systems. Job schedulers allocate computing resources and control the execution of processes on those resources. Historically, job schedulers were the domain of supercomputers, and job schedulers were designed to run massive, long-running computations over days and weeks. More recently, big data workloads have created a need for a new class of computations consisting of many short computations taking seconds or minutes that process enormous quantities of data. For both supercomputers and big data systems, the efficiency of the job scheduler represents a fundamental limit on the efficiency of the system. Detailed measurement and modeling of the perfor mance of schedulers are critical for maximizing the performance of a large-scale computing system. This paper presents a detailed feature analysis of 15 supercomputing and big data schedulers. For big data workloads, the scheduler latency is the most important performance characteristic of the scheduler. A theoretical model of the latency of these schedulers is developed and used to design experiments targeted at measuring scheduler latency. Detailed benchmarking of four of the most popular schedulers (Slurm, Son of Grid Engine, Mesos, and Hadoop YARN) is conducted. The theoretical model is compared with data and demonstrates that scheduler performance can be characterized by two key parameters: the marginal latency of the scheduler ts and a nonlinear exponent αs. For all four schedulers, the utilization of the computing system decreases to <10% for computations lasting only a few seconds. Multi-level schedulers (such as LLMapReduce) that transparently aggregate short computations can improve utilization for these short computations to >90% for all four of the schedulers that were tested.
Keywords: Scheduler ، Resource manager ، Job scheduler ، High performance computing ، Data analytics
A course on big data analytics
دوره ای در تجزیه و تحلیل داده های بزرگ-2018
This report details a course on big data analytics designed for undergraduate junior and senior computer science students. The course is heavily focused on projects and writing code for big data processing. It is designed to help students learn parallel and distributed computing frameworks and techniques commonly used in industry. The curriculum includes a progression of projects requiring increasingly sophisticated big data processing ranging from data preprocessing with Linux tools, distributed pro cessing with Hadoop MapReduce and Spark, and database queries with Hive and Google’s BigQuery. We discuss hardware infrastructure and experimentally evaluate the cost/benefit of an on-premise server versus Amazon’s Elastic MapReduce. Finally, we showcase outcomes of our course in terms of student engagement and anonymous student feedback.
Keywords: Curriculum ، Undergraduate education ، Big data ،Cloud computing
Concurrence of big data analytics and healthcare: A systematic review
انطباق با تجزیه و تحلیل داده های بزرگ و مراقبت های بهداشتی: یک مرور سیستماتیک-2018
Background: The application of Big Data analytics in healthcare has immense potential for improving the quality of care, reducing waste and error, and reducing the cost of care. Purpose: This systematic review of literature aims to determine the scope of Big Data analytics in healthcare including its applications and challenges in its adoption in healthcare. It also intends to identify the strategies to overcome the challenges. Data sources: A systematic search of the articles was carried out on five major scientific databases: ScienceDirect, PubMed, Emerald, IEEE Xplore and Taylor & Francis. The articles on Big Data analytics in healthcare published in English language literature from January 2013 to January 2018 were considered. Study selection: Descriptive articles and usability studies of Big Data analytics in healthcare and medicine were selected. Data extraction: Two reviewers independently extracted information on definitions of Big Data analytics; sources and applications of Big Data analytics in healthcare; challenges and strategies to overcome the challenges in healthcare. Results: A total of 58 articles were selected as per the inclusion criteria and analyzed. The analyses of these articles found that: (1) researchers lack consensus about the operational definition of Big Data in healthcare; (2) Big Data in healthcare comes from the internal sources within the hospitals or clinics as well external sources including government, laboratories, pharma companies, data aggregators, medical journals etc.; (3) natural language processing (NLP) is most widely used Big Data analytical technique for healthcare and most of the processing tools used for analytics are based on Hadoop; (4) Big Data analytics finds its application for clinical decision support; optimization of clinical operations and reduction of cost of care (5) major challenge in adoption of Big Data analytics is non-availability of evidence of its practical benefits in healthcare. Conclusion: This review study unveils that there is a paucity of information on evidence of real-world use of Big Data analytics in healthcare. This is because, the usability studies have considered only qualitative approach which describes potential benefits but does not take into account the quantitative study. Also, majority of the studies were from developed countries which brings out the need for promotion of research on Healthcare Big Data analytics in developing countries.
Keywords: Big data , Analytics , Healthcare , Predictive analytics , Evidence-based medicine
Enhancing water system models by integrating big data
افزایش مدل های سیستم آب با ادغام داده های بزرگ-2018
The past quarter century has witnessed development of advanced modeling approaches, such as stochastic and agent-based modeling, to sustainably manage water systems in the presence of deep uncertainty and complexity. However, all too often data inputs for these powerful models are sparse and outdated, yielding unreliable results. Advancements in sensor and communication technologies have allowed for the ubiquitous deployment of sen sors in water resources systems and beyond, providing high-frequency data. Processing the large amount of heterogeneous data collected is non-trivial and exceeds the capacity of traditional data warehousing and pro cessing approaches. In the past decade, significant advances have been made in the storage, distribution, querying, and analysis of big data. Many tools have been developed by computer and data scientists to facilitate the manipulation of large datasets and create pipelines to transmit the data from data warehouses to compu tational analytic tools. A generic framework is presented to complete the data cycle for a water system. The data cycle presents an approach for integrating high-frequency data into existing water-related models and analyses, while highlighting some of the more helpful data management tools. The data tools are helpful to make sus tainable decisions, which satisfy the objectives of a society. Data analytics distribution tool Spark is introduced through the illustrative application of coupling high-frequency demand metering data with a water distribution model. By updating the model in near real-time, the analysis is more accurate and can expose serious mis interpretations.
Keywords: Water systems , Modeling , Big data , Automation , Hadoop , Apache Spark , Cloud computing