Big Data Based Security Analytics for Protecting Virtualized Infrastructures in Cloud Computing
تحلیل امنیتی بر اساس داده های بزرگ برای حفاظت زیرساخت های مجازی شده در محاسبات ابری-2018
Virtualized infrastructure in cloud computing has become an attractive target for cyberattackers to launch advanced attacks. This paper proposes a novel big data based security analytics approach to detecting advanced attacks in virtualized infrastructures. Network logs as well as user application logs collected periodically from the guest virtual machines (VMs) are stored in the Hadoop Distributed File System (HDFS). Then, extraction of attack features is performed through graph-based event correlation and MapReduce parser based identification of potential attack paths. Next, determination of attack presence is performed through two-step machine learning, namely logistic regression is applied to calculate attack’s conditional probabilities with respect to the attributes, and belief propagation is applied to calculate the belief in existence of an attack based on them. Experiments are conducted to evaluate the proposed approach using well-known malware as well as in comparison with existing security techniques for virtualized infrastructure. The results show that our proposed approach is effective in detecting attacks with minimal performance overhead.
Index Terms: Virtualized infrastructure, virtualization security, cloud security, malware detection, rootkit detection, security analytics, event correlation, logistic regression, belief propagation
Achieving high performance and privacy-preserving query over encrypted multidimensional big metering data
دستیابی به پرس و جو با حفظ عملکرد بالا و حفظ حریم خصوصی بیش از داده های اندازه گیری بزرگ چند بعدی محصور شده-2018
With the proliferation of smart grids, traditional utilities are struggling to handle the increasing amount of metering data. Outsourcing the metering data to heterogeneous distributed systems has the poten tial to provide efficient data access and processing. In an untrusted heterogeneous distributed system environment, employing data encryption prior to outsourcing can be an effective way to preserve user privacy. However, how to efficiently query encrypted multidimensional metering data stored in an un trusted heterogeneous distributed system environment remains a research challenge. In this paper, we propose a high performance and privacy-preserving query (P2Q) scheme over encrypted multidimen sional big metering data to address this challenge. In the proposed scheme, encrypted metering data are stored in the server of an untrusted heterogeneous distributed system environment. A Locality Sensitive Hashing (LSH) based similarity search approach is then used to realize the similarity query. To demon strate utility of the proposed LSH-based search approach, we implement a prototype using MapReduce for the Hadoop distributed environment. More specifically, for a given query, the proxy server will return K top similar data object identifiers. An enhanced Ciphertext-Policy Attribute-based Encryption (CP-ABE) policy is then used to control access to the search results. Therefore, only the requester with an authorized query attribute can obtain the correct secret keys to retrieve the metering data. We then prove that the P2Q scheme achieves data confidentiality and preserves the data owner’s privacy in a semi-trusted cloud. In addition, our evaluations demonstrate that the P2Q scheme can significantly reduce response time and provide high search efficiency without compromising on search quality (i.e. suitable for multidimensional big data search in heterogeneous distributed system, such as cloud storage system).
Keywords: Smart grid ، High performance ، Privacy preservation ، Similarity query ، Multidimensional big metering data
FASTEN: An FPGA Based Secure System for Big Data Processing
FASTEN: یک سیستم امن بر اساس FPGA برای پردازش داده های بزرگ-2018
In cloud computing framework, the data security and protection is one of the most important aspects for optimization and concrete implementation. This paper proposes a reliable yet efficient FPGA-based security system via crypto engines and Physical Unclonable Functions (PUFs) for big data applications. Considering that FPGA or GPU-based accelerators are popular in data centers, we believe the proposed approach is very practical and effective method for data security in cloud computing.
Keywords: FPGA, Security,Big Data, Cloud Computing, Hadoop MapReduce
A Distributed Fuzzy Associative Classifier for Big Data
طبقه بندی کننده انجمنی توزیع شده فازی برای داده های بزرگ-2018
Fuzzy associative classification has not been widely analyzed in the literature, although associative classifiers (ACs) have proved to be very effective in different real domain applications. The main reason is that learning fuzzy ACs is a very heavy task, especially when dealing with large datasets. To overcome this drawback, in this paper, we propose an efficient distributed fuzzy associative classification approach based on the MapReduce paradigm. The approach exploits a novel distributed discretizer based on fuzzy entropy for efficiently generating fuzzy partitions of the attributes. Then, a set of candidate fuzzy association rules is generated by employing a distributed fuzzy extension of the well-known FP-Growth algorithm. Finally, this set is pruned by using three purposely adapted types of pruning. We implemented our approach on the popular Hadoop framework. Hadoop allows distributing storage and processing of very large data sets on computer clusters built from commodity hardware. We have performed an extensive experimentation and a detailed analysis of the results using six very large datasets with up to 11 000 000 instances. We have also experimented different types of reasoning methods. Focusing on accuracy, model complexity, computation time, and scalability, we compare the results achieved by our approach with those obtained by two distributed nonfuzzy ACs recently proposed in the literature. We highlight that, although the accuracies result to be comparable, the complexity, evaluated in terms of number of rules, of the classifiers generated by the fuzzy distributed approach is lower than the one of the nonfuzzy classifiers.
Index Terms: Associative classifier (AC), big data, fuzzy AC (FAC), fuzzy FP-Growth, Hadoop, MapReduce
Apriori Versions Based on MapReduce for Mining Frequent Patterns on Big Data
نسخه های Apriori بر اساس MapReduce برای اگو کاوی مکرر بر روی داده های بزرگ-2018
Pattern mining is one of the most important tasks to extract meaningful and useful information from raw data. This task aims to extract item-sets that represent any type of homogeneity and regularity in data. Although many efficient algorithms have been developed in this regard, the growing interest in data has caused the performance of existing pattern mining techniques to be dropped. The goal of this paper is to propose new efficient pattern mining algorithms to work in big data. To this aim, a series of algorithms based on the MapReduce framework and the Hadoop open-source implementation have been proposed. The proposed algorithms can be divided into three main groups. First, two algorithms [Apriori MapReduce (AprioriMR) and iterative AprioriMR] with no pruning strategy are proposed, which extract any existing itemset in data. Second, two algorithms (space pruning AprioriMR and top AprioriMR) that prune the search space by means of the well-known anti-monotone property are proposed. Finally, a last algorithm (maximal AprioriMR) is also proposed for mining condensed representations of frequent patterns. To test the performance of the proposed algorithms, a varied collection of big data datasets have been considered, comprising up to 3·1018 transactions and more than 5 million of distinct single-items. The experimental stage includes comparisons against highly efficient and well-known pattern mining algorithms. Results reveal the interest of applying MapReduce versions when complex problems are considered, and also the unsuitability of this paradigm when dealing with small data.
Index Terms: Big data, Hadoop, MapReduce, pattern mining
MIA: Metric Importance Analysis for Big Data Workload Characterization
MIA: تحلیل اهمیت متریک برای خصوصیات بار کار داده های بزرگ-2018
Data analytics is at the foundation of both high-quality products and services in modern economies and societies. Big data workloads run on complex large-scale computing clusters, which implies significant challenges for deeply understanding and characterizing overall system performance. In general, performance is affected by many factors at multiple layers in the system stack, hence it is challenging to identify the key metrics when understanding big data workload performance. In this paper, we propose a novel workload characterization methodology using ensemble learning, called Metric Importance Analysis (MIA), to quantify the respective importance of workload metrics. By focusing on the most important metrics, MIA reduces the complexity of the analysis without losing information. Moreover, we develop the MIA-based Kiviat Plot (MKP) and Benchmark Similarity Matrix (BSM) which provide more insightful information than the traditional linkage clustering based dendrogram to visualize program behavior (dis)similarity. To demonstrate the applicability of MIA, we use it to characterize three big data benchmark suites: HiBench, CloudRank-D and SZTS. The results show that MIA is able to characterize complex big data workloads in a simple, intuitive manner, and reveal interesting insights. Moreover, through a case study, we demonstrate that tuning the configuration parameters related to the important metrics found by MIA results in higher performance improvements than through tuning the parameters related to the less important ones.
Index Terms: Big data, benchmarking, workload characterization, performance measurement, MapReduce/hadoop
Socio-cyber network: The potential of cyber-physical system to define human behaviors using big data analytics
شبکه اجتماعی سایری: پتانسیل سیستم فیزیکی سایبری برای تعریف رفتارهای انسانی با استفاده از تجزیه و تحلیل داده های بزرگ-2018
The growing gap between users and the big data analytics requires innovative tools that address the chal lenges faced by big data volume, variety, and velocity. Therefore, it becomes computationally inefficient to analyze such massive volume of data. Moreover, advancements in the field of big data application and data science leads toward a new paradigm of human behavior, where various smart devices integrate with each other and establish a relationship. However, majority of the systems are either memoryless or computational inefficient, which are unable to define or predict human behavior. Therefore, keeping in view the aforementioned needs, there is a requirement for a system that can efficiently analyze a stream of big data within their requirements. Hence, this paper presents a system architecture that integrates social network with the technical network. We derive a novel notion of ‘Socio-Cyber Network’, where a friendship is made based on the geo-location information of the user, where trust index is used based on graphs theory. The proposed graph theory provides a better understanding of extraction knowledge from the data and finding relationship between different users. To check the efficiency of the proposed algorithms exploited in the proposed system architecture, we have implemented our proposed system using Hadoop and MapReduce. MapReduce for cyber-physical system (CPS) is supported by a parallel algorithm that efficiently process a huge volume of data sets. The system is implemented using Spark GraphX tool at the top of the Hadoop parallel nodes to generate and process graphs with near real-time. Moreover, the system is evaluated in terms of efficiency by considering the system throughput and processing time. The results show that the proposed system is more scalable and efficient.
Keywords: Big data ، Socio-cyber network ، Human behavior ، Graphs ، Friendship ، Trust index
A novel process-based association rule approach through maximal frequent itemsets for big data processing
یک روش ریشه جدید مبتنی بر فرایند از طریق مجموعه های حداکثر تکراری برای پردازش داده های بزرگ-2018
The maximal frequent itemsets issue in big data processing has become a hot research topic. Most of the previous work on big data processing directly analyzes the data through the existing approaches, which would cause problems of redundant computation, high time complexity, and large storage space. To solve the problems, this paper proposes a Heuristic MapReduce-based Association rule approach through Maximal frequent itemsets mining, HMAM. The main idea is: At first, by directly operating on the transaction database, we allocate transactions to different processing nodes and group all transactions according to dimension. Then, we screen the most frequent transactions from each transaction set using the Bitmap-Sort and obtain best-transaction-set through aggregating all the transaction-elects of each transaction set. The current candidate maximal frequent itemsets can be acquired by removing sub-transactions in terms of the inclusion relations of the items in best-transaction-set. At the same time, each subset of sub-transactions in the candidate maximal frequent itemsets is discarded from all transaction sets. Then the final candidate maximal frequent itemsets can be obtained by iteration until each transaction set is empty. Finally, we achieve the acquisition of maximal frequent itemsets by employing the minimum support threshold. The experimental results demonstrate that compared with the existing approaches, HMAM significantly avoids producing a large number of candidate itmesets resulting from join operation, accelerates the speed of mining the maximal frequent itemsets, and improves the utilization rate of resources simultaneously.
Keywords: Maximal frequent itemsets ، Big data ، MapReduce ، Frequent transactions ، Best-transaction-set
In-Mapper combiner based MapReduce algorithm for processing of big climate data
الگوریتم MapReduce مبتنی بر ترکیب Mapper در پردازش داده های آب و هوایی بزرگ -2018
Big data refers to a collection of massive volume of data that cannot be processed by conventional data processing tools and technologies. In recent years, the data production sources are enlarged noticeably, such as high-end streaming devices, wireless sensor networks, satellite, wearable Internet of Things (IoT) devices. These data generation sources generate a massive volume of data in a continuous manner. The large volume of climate data is collected from the IoT weather sensor devices and NCEP. In this paper, the big data processing framework is proposed to integrate climate and health data and to find the correlation between the climate parameters and incidence of dengue. This framework is demonstrated with the help of MapReduce programming model, Hive, HBase and ArcGIS in a Hadoop Distributed File System (HDFS) environment. The following weather parameters such as minimum temperature, maximum temperature, wind, precipitation, solar and relative humidity are collected for the study are Tamil Nadu with the help of IoT weather sensor devices and NCEP. Proposed framework focuses only on climate data for 32 districts of Tamil Nadu where each district contains 1,57,680 rows and so there are 50,45,760 rows in total. Batch view precomputation for the monthly mean of various climate parameters would require 50,45,760 rows. Hence, this would create more latency in query processing. In order to overcome this issue, batch views can precompute for a smaller number of records and involve more computation to be done at query time. The In-Mapper based MapReduce framework is used to compute the monthly mean of climate parameter for each latitude and longitude. The experimental results prove the effectiveness of the response time for the In-Mapper based combiner algorithm is less when compared with the existing MapReduce algorithm.
Keywords: Big data ، Internet of Things ، Weather sensor devices ، MapReduce programming ،Model ، Hadoop distributed file system
TPTVer: A Trusted Third Party Based Trusted Verifier for Multi-Layered Outsourced Big Data System in Cloud Environment
TPTVer: یک تایید کننده معتبر مبتنی بر شخص ثالث برای سیستم داده های بزرگ برون سپاری چند لایه در محیط ابری-2018
Cloud computing is very useful for big data owner who doesn’t want to manage IT infrastructure and big data technique details. However, it is hard for big data owner to trust multi-layer outsourced big data system in cloud environment and to verify which outsourced service leads to the problem. Similarly, the cloud service provider cannot simply trust the data computation applications. At last, the verification data itself may also leak the sensitive information from the cloud service provider and data owner. We propose a new three-level definition of the verification, threat model, corresponding trusted policies based on different roles for outsourced big data system in cloud. We also provide two policy enforcement methods for building trusted data computation environment by measuring both the MapReduce application and its behaviors based on trusted computing and aspect-oriented programming. To prevent sensitive information leakage from verification process, we provide a privacy-preserved verification method. Finally, we implement the TPTVer, a Trusted third Party based Trusted Verifier as a proof of concept system. Our evaluation and analysis show that TPTVer can provide trusted verification for multi-layered outsourced big data system in the cloud with low overhead.
Keywords: big data security; outsourced ser vice security; MapReduce behavior; trusted verification; trusted third party