Toward modeling and optimization of features selection in Big Data based social Internet of Things
به سوی مدل سازی و بهینه سازی انتخاب ویژگی ها در داده های بزرگ مبتنی بر اینترنت اشیا اجتماعی-2018
The growing gap between users and the Big Data analytics requires innovative tools that address the challenges faced by big data volume, variety, and velocity. Therefore, it becomes computationally inefficient to analyze and select features from such massive volume of data. Moreover, advancements in the field of Big Data application and data science poses additional challenges, where a selection of appropriate features and High-Performance Computing (HPC) solution has become a key issue and has attracted attention in recent years. Therefore, keeping in view the needs above, there is a requirement for a system that can efficiently select features and analyze a stream of Big Data within their requirements. Hence, this paper presents a system architecture that selects features by using Artificial Bee Colony (ABC). Moreover, a Kalman filter is used in Hadoop ecosystem that is used for removal of noise. Furthermore, traditional MapReduce with ABC is used that enhance the processing efficiency. Moreover, a complete four-tier architecture is also proposed that efficiently aggregate the data, eliminate unnecessary data, and analyze the data by the proposed Hadoop-based ABC algorithm. To check the efficiency of the proposed algorithms exploited in the proposed system architecture, we have implemented our proposed system using Hadoop and MapReduce with the ABC algorithm. ABC algorithm is used to select features, whereas, MapReduce is supported by a parallel algorithm that efficiently processes a huge volume of data sets. The system is implemented using MapReduce tool at the top of the Hadoop parallel nodes with near real time. Moreover, the proposed system is compared with Swarm approaches and is evaluated regarding efficiency, accuracy and throughput by using ten different data sets. The results show that the proposed system is more scalable and efficient in selecting features.
Keywords: SIoT ، Big Data ، ABC algorithm، Feature selection
A new architecture of Internet of Things and big data ecosystem for secured smart healthcare monitoring and alerting system
معماری جدید اینترنت اشیاء و اکوسیستم داده های بزرگ برای نظارت بر سیستم مراقبت سلامت هوشمند و سیستم هشدار دهنده امن-2018
Wearable medical devices with sensor continuously generate enormous data which is often called as big data mixed with structured and unstructured data. Due to the complexity of the data, it is difficult to process and analyze the big data for finding valuable information that can be useful in decision making. On the other hand, data security is a key requirement in healthcare big data system. In order to overcome this issue, this paper proposes a new architecture for the implementation of IoT to store and process scalable sensor data (big data) for health care applications. The Proposed architecture consists of two main sub architectures, namely, Meta Fog-Redirection (MF-R) and Grouping and Choosing (GC) architecture. MF-R architecture uses big data technologies such as Apache Pig and Apache HBase for collection and storage of the sensor data (big data) generated from different sensor devices. The proposed GC architecture is used for securing integration of fog computing with cloud computing. This architecture also uses key management service and data categorization function (Sensitive, Critical and Normal) for providing security services. The framework also uses MapReduce based prediction model to predict the heart diseases. Performance evaluation parameters such as throughput, sensitivity, accuracy, and f-measure are calculated to prove the efficiency of the proposed architecture as well as the prediction model.
Keywords: Wireless sensor networks ، Internet of Things ، Big data analytics ، Cloud computing and health car
A novel adaptive e-learning model based on Big Data by using competence-based knowledge and social learner activities
یک مدل تطبیقی جدید یادگیری الکترونیکی مبتنی بر داده های بزرگ با استفاده ازدانش مبتنی بر شایستگی و فعالیت های یادگیرنده اجتماعی-2018
The e-learning paradigm is becoming one of the most important educational methods, which is a deci sive factor for learning and for making learning relevant. However, most existing e-learning platforms offer traditional e-learning system in order that learners access the same evaluation and learning con tent. In response, Big Data technology in the proposed adaptive e-learning model allowed to consider new approaches and new learning strategies. In this paper, we propose an adaptive e-learning model for providing the most suitable learning content for each learner. This model based on two levels of adaptive e-learning. The first level involves two steps: (1) determining the relevant future educational objectives through the adequate learner e-assessment method using MapReduce-based Genetic Algo rithm, (2) generating adaptive learning path for each learner using the MapReduce-based Ant Colony Optimization algorithm. In the second level, we propose MapReduce-based Social Networks Analysis for determining the learner motivation and social productivity in order to assign a specific learning rhythm to each learner. Finally, the experimental results show that the presented algorithms implemented on Big Data environment converge much better than those implementations with traditional concurrent works. Also, this work provides main benefit because it describes how Big Data technology transforms e-learning paradigm.
Keywords: Adaptive e-learning ، Big data ، MapReduce ، Genetic algorithm ، Personalized learning path ، Ant colony optimization algorithms ، Social networks analysis ، Motivation and productivity ، Learning content
Improving the Effectiveness of Burst Buffers for Big Data Processing in HPC Systems with Eley
بهبود اثربخشی بافرها پشت سر هم برای پردازش داده های بزرگ در سیستم های HPC با Eley-2018
Burst Buffer is an effective solution for reducing the data transfer time and the I/O interference in HPC systems. Extending Burst Buffers (BBs) to handle Big Data applications is challenging because BBs must account for the large data inputs of Big Data applications and the Quality-of-Service (QoS) of HPC applications—which are considered as first-class citizens in HPC systems. Existing BBs focus on only intermediate data of Big Data applications and incur a high performance degradation of both Big Data and HPC applications. We present Eley, a burst buffer solution that helps to accelerate the performance of Big Data applications while guaranteeing the QoS of HPC applications. To achieve this goal, Eley embraces interference-aware prefetching technique that makes reading data input faster while introducing low interference for HPC applications. Evaluations using a wide range of Big Data and HPC applications demonstrate that Eley improves the performance of Big Data applications by up to 30% compared to existing BBs while maintaining the QoS of HPC applications.
Keywords: HPC ، MapReduce ، Big data ، Parallel file systems ، Burst buffers ، Interference ، Prefetch
An experimental survey on big data frameworks
یک بررسی تجربی در چارچوب داده های بزرگ-2018
Recently, increasingly large amounts of data are generated from a variety of sources. Existing data pro cessing technologies are not suitable to cope with the huge amounts of generated data. Yet, many research works focus on Big Data, a buzzword referring to the processing of massive volumes of (unstructured) data. Recently proposed frameworks for Big Data applications help to store, analyze and process the data. In this paper, we discuss the challenges of Big Data and we survey existing Big Data frameworks. We also present an experimental evaluation and a comparative study of the most popular Big Data frameworks with several representative batch and iterative workloads. This survey is concluded with a presentation of best practices related to the use of studied frameworks in several application domains such as machine learning, graph processing and real-world applications.
Keywords: Big data ، MapReduce ، Hadoop ، HDFS ، Spark ، Flink ، Storm ،Samza ، Batch/stream processing
An improved distributed storage and query for remote sensing data
ذخیره سازی توزیع شده بهبود یافته و پرس و جو برای داده های حسی راه دور-2018
With the rapid development of information technology, the amount of remote sensing data is increasing at an unprecedented scale. In the presence of massive remote sensing data, the traditional processing methods have the problems of low efficiency and lack of scalability, so this paper uses open source big data technology to improve it. Firstly, the storage model of remote sensing image data is designed by using the distributed storage database HBase. Then, the grid index and the Hibert curve are combined to establish the index for the image data. Finally, the method of MapReduce parallel processing is used to write and query remote sensing images. The experimental results show that the method can effectively improve the data writing and query speed, and has good scalability.
Keywords: remote data; distribute storage; data query; HBase; mapreduce
A Comparison of Big Remote Sensing Data Processing with Hadoop MapReduce and Spark
مقايسه پردازش داده های حسی راه دور با MapReduce و Spark Hadoop-2018
The continuous generation of huge amount of re mote sensing (RS) data is becoming a challenging task for researchers due to the 4 Vs characterizing this type of data (volume, variety, velocity and veracity). Many platforms have been proposed to deal with big data in RS field. This paper focus on the comparison of two well-known platforms of big RS data namely Hadoop and Spark. We start by describing the two platforms Hadoop and Spark. The first platform is designed for processing enormous unstructured data in a distributed comput ing environment. It is composed of two basic elements : 1) Hadoop Distributed file system for storage, and 2) Mapreduce and Yarn for parallel processing, scheduling the jobs and analyzing big RS data. The second platform, Spark, is composed by a set of libraries and uses the resilient distributed data set to overcome the computational complexity. The last part of this paper is devoted to a comparison between the two platforms.
Index Terms : Big Data, Architectures, Hadoop, Spark, Remote Sensing Image
Efficient jobs scheduling approach for big data applications
رویکرد برنامه ریزی شغلی کارآمد برای داده های بزرگ-2018
The MapReduce framework has become a leading scheme for processing large-scale data applications in recent years. However, big data applications executed on computer clusters require a large amount of energy, which costs a considerable fraction of the data center’s overall costs. Therefore, for a data center, how to reduce the energy consumption becomes a critical issue. Although Hadoop YARN adopts fine-grained resource management schemes for job scheduling, it doesn’t consider the energy saving problem. In this paper, an Energy-aware Fair Scheduling framework based on YARN (denoted as EFS) is proposed, which can effectively reduce energy consumption while meet the required Service Level Agreements (SLAs). EFS not only can schedule jobs to en ergy-efficiency nodes, but also can power on or off the nodes. To do so, the energy-aware dynamic capacity management with deadline-driven policy is used to allocate the resources for MapReduce tasks in terms of the average execution time of containers and users request resources. And then, Energy-aware fair based scheduling problem is modeled as multi-dimensional knapsack problem (MKP) and the energy-aware greedy algorithm (EAGA) is proposed to realize tasks fine-grained placement on energy-efficient nodes. Finally, the nodes which have been kept in idle state for the threshold duration are turned off to reduce energy costs. We perform ex tensive experiments on the Hadoop YARN clusters to compare the energy consumption and executing time of EFS with some state-of-the-art policies. The experimental results show that EFS can not only keep the proper number of nodes in on states to meet the computing requirements but also achieve the goal of energy savings.
Keywords: Big data ، Dynamic scheduling ، Energy efficiency ، MapReduce ، Resource allocation
Scalable system scheduling for HPC and big data
برنامه ریزی مقیاس پذیر برای HPC و داده های بزرگ-2018
In the rapidly expanding field of parallel processing, job schedulers are the ‘‘operating systems’’ of modern big data architectures and supercomputing systems. Job schedulers allocate computing resources and control the execution of processes on those resources. Historically, job schedulers were the domain of supercomputers, and job schedulers were designed to run massive, long-running computations over days and weeks. More recently, big data workloads have created a need for a new class of computations consisting of many short computations taking seconds or minutes that process enormous quantities of data. For both supercomputers and big data systems, the efficiency of the job scheduler represents a fundamental limit on the efficiency of the system. Detailed measurement and modeling of the perfor mance of schedulers are critical for maximizing the performance of a large-scale computing system. This paper presents a detailed feature analysis of 15 supercomputing and big data schedulers. For big data workloads, the scheduler latency is the most important performance characteristic of the scheduler. A theoretical model of the latency of these schedulers is developed and used to design experiments targeted at measuring scheduler latency. Detailed benchmarking of four of the most popular schedulers (Slurm, Son of Grid Engine, Mesos, and Hadoop YARN) is conducted. The theoretical model is compared with data and demonstrates that scheduler performance can be characterized by two key parameters: the marginal latency of the scheduler ts and a nonlinear exponent αs. For all four schedulers, the utilization of the computing system decreases to <10% for computations lasting only a few seconds. Multi-level schedulers (such as LLMapReduce) that transparently aggregate short computations can improve utilization for these short computations to >90% for all four of the schedulers that were tested.
Keywords: Scheduler ، Resource manager ، Job scheduler ، High performance computing ، Data analytics
A course on big data analytics
دوره ای در تجزیه و تحلیل داده های بزرگ-2018
This report details a course on big data analytics designed for undergraduate junior and senior computer science students. The course is heavily focused on projects and writing code for big data processing. It is designed to help students learn parallel and distributed computing frameworks and techniques commonly used in industry. The curriculum includes a progression of projects requiring increasingly sophisticated big data processing ranging from data preprocessing with Linux tools, distributed pro cessing with Hadoop MapReduce and Spark, and database queries with Hive and Google’s BigQuery. We discuss hardware infrastructure and experimentally evaluate the cost/benefit of an on-premise server versus Amazon’s Elastic MapReduce. Finally, we showcase outcomes of our course in terms of student engagement and anonymous student feedback.
Keywords: Curriculum ، Undergraduate education ، Big data ،Cloud computing