On Distributed Fuzzy Decision Trees for Big Data
درخت تصمیم گیری فازی توزیع شده برای داده های بزرگ-2018
Fuzzy decision trees (FDTs) have shown to be an effective solution in the framework of fuzzy classification. The approaches proposed so far to FDT learning, however, have generally neglected time and space requirements. In this paper, we propose a distributed FDT learning scheme shaped according to the MapReduceprogrammingmodelforgeneratingbothbinaryandmultiway FDTs from big data. The scheme relies on a novel distributed fuzzy discretizer that generates a strong fuzzy partition for each continuous attribute based on fuzzy information entropy. The fuzzy partitions are, therefore, used as an input to the FDT learning algorithm, which employs fuzzy information gain for selecting the attributes at the decision nodes. We have implemented the FDT learning scheme on the Apache Spark framework. We have used ten real-world publicly available big datasets for evaluating the behavior of the scheme along three dimensions: 1) performance in terms of classification accuracy, model complexity, and execution time; 2) scalability varying the number of computing units; and 3) ability to efficiently accommodate an increasing dataset size. We have demonstrated that the proposed scheme turns out to be suitable for managing big datasets even with a modest commodity hardware support. Finally, we have used the distributed decision tree learning algorithm implemented in the MLLib library and the Chi-FRBCS-BigData algorithm, a MapReduce distributed fuzzy rule-based classification system, for comparative analysis
Index Terms: Apache spark, big data, fuzzy decision trees (FDTs), fuzzy discretizer, fuzzy entropy, fuzzy partitioning,MapReduce
CAMP: Accurate Modeling of Core and Memory Locality for Proxy Generation of Big-data Applications
CAMP: مدل سازی دقیق هسته و حافظه برای تولید پروکسی نرم افزارهای داده های بزرگ-2018
Fast and accurate design-space exploration is a critical requirement for enabling future hardware designs. However, big-data applications are often complex targets to evaluate on early performance models (e.g., simulators or RTL models) owing to their complex software-stacks, significantly long run times, system dependencies and the limited speed of performance models. To overcome the challenges in benchmarking complex big-data applications, in this paper, we propose a proxy generation methodology, CAMP that can generate miniature proxy benchmarks, which are representative of the performance of bigdata applications and yet converge to results quickly without needing any complex software stack support. Prior system-level proxy generation techniques model core locality features in detail, but abstract out memory locality modeling using simple stridebased models, which results in poor cloning accuracy for most applications. CAMP accurately models both core-performance and memory locality, along with modeling the feedback loop between the two. CAMP replicates core performance by modeling the dependencies between instructions, instruction types, controlflow behavior, etc. CAMP also adds a memory locality profiling approach that captures spatial and temporal locality of applications. Finally, we propose a novel proxy replay methodology that integrates the core and memory locality models to create accurate system-level proxy benchmarks. We demonstrate that CAMP proxies can mimic the original application’s performance behavior and that they can capture the performance feedback loop well. For a variety of real-world big-data applications, we show that CAMP achieves an average cloning accuracy of 89%. We believe this is a new capability that can facilitate for overall system (core and memory subsystem) design exploration
On Distributed Fuzzy Decision Trees for Big Data
درخت تصمیم گیری فازی توزیع شده برای داده های بزرگ-2017
Fuzzy decision trees (FDTs) have shown to be an effective solution in the framework of fuzzy classification. The approaches proposed so far to FDT learning, however, have generally neglected time and space requirements. In this paper, we propose a distributed FDT learning scheme shaped according to the MapReduce programming model for generating both binary and multi-way FDTs from big data. The scheme relies on a novel distributed fuzzy discretizer that generates a strong fuzzy partition for each continuous attribute based on fuzzy information entropy. The fuzzy partitions are therefore used as input to the FDT learning algorithm, which employs fuzzy information gain for selecting the attributes at the decision nodes. We have implemented the FDT learning scheme on the Apache Spark framework. We have used ten real-world publicly available big datasets for evaluating the behavior of the scheme along three dimensions: i) performance in terms of classification accuracy, model complexity and execution time, ii) scalability varying the number of computing units and iii) ability to efficiently accommodate an increasing dataset size. We have demonstrated that the proposed scheme turns out to be suitable for managing big datasets even with modest commodity hardware support. Finally, we have used the distributed decision tree learning algorithm implemented in the MLLib library and the Chi-FRBCS-BigData algorithm, a MapReduce distributed fuzzy rule-based classification system, for comparative analysis.
Keywords: Fuzzy Decision Trees | Big Data | Fuzzy Entropy | Fuzzy Discretizer | Apache Spark | MapReduce | Fuzzy Partitioning
Big Data Processing Stacks
پشته های پردازش داده های بزرگ-2017
The radical expansion and integration of computation, networking, digital devices, and data storage has generat- ed large amounts of data that must be processed, shared, and analyzed. For example,Facebook generates more than 10 petabytes of log data monthly, and Google processes hun dreds of petabytes per month. Alibaba generates tens of terabytes in daily online trading trans actions. This collected information is grow ing, and the explosive increase of global data in all 3Vs (volume, velocity, and variety) has been termed big data. According to IBM, we are cur rently creating 2.5 quintillion bytes of data ev ery day (https://www-01.ibm.com/software/data/ bigdata/what-is-big-data.html). IDC predicts that the worldwide volume of data will reach 40 zettabytes by 2020, 85 percent of which will be unstructured data in new types and formats—including server logs and other machine-generated data, data from sensors, social media data, and many other sources (www.emc.com/about/news/ press/2012/20121211-01.htm). In practice, these conditions represent a new scale of big data that has been attracting a lot of interest from both the research and industrial communities, which hope to create the best means of processing, analyzing, and using this data.
Wireless Big-Data: Opportunity and the Design Challenging
داده های بزرگ بی سیم: فرصت ها و چالش های طراحی-2016
Recent technological advancements have led to a deluge of data from distinctive domains over the past several years. In addition to its sheer volume, big data also exhibits other unique characteristics as compared with traditional data. For instance, more and more data-centers which host various ‘bigdata’, and those centers need to be networked for exercising Cloud-computing. This development calls for new system architectures for data acquisition and transmission. In this paper, we present a definition and challenges for distributed big data. Finally, we outline several potential research directions for distributed big data systems.
Index Terms : Wireless| distributed big data| THz
A First Approach in Evolutionary Fuzzy Systems based on the lateral tuning of the linguistic labels for Big Data classification
یک رویکرد اول در سیستم های فازی تکاملی بر اساس تنظیم جانبی از برچسب های زبانی برای طبقه بندی داده های بزرگ-2016
The treatment and processing of Big Data problems imply an essential advantage for researchers and corporations. This is due to the huge quantity of knowledge that is hidden within the vast amount of information that is available nowadays. In order to be able to address with such volume of information in an efficient way, the scalability for Big Data applications is achieved by means of the MapReduce programming model. It is designed to divide the data into several chunks or groups that are processed in parallel, and whose result is “assembled” to provide a single solution. Focusing on classification tasks, Fuzzy Rule Based Classification Systems have shown interesting results with a MapReduce approach for Big Data. It is well known that the behaviour of these type of systems can be further improved in synergy with Evolutionary Algorithms, leading to Evolutionary Fuzzy Systems. However, to be best of our knowledge there are no developments in this field yet. In this work, we propose a first Evolutionary Fuzzy System for Big Data problems. It consists of an initial Knowledge Based build by means of the Chi-FRBCS-BigData algorithm, followed by a genetic tuning of the Data Base by means of the 2-tuples representation. This way, the fuzzy labels will be better contextualized within every subset of the problem, and the coverage of the Rule Base will be enhanced. Then, the Knowledge Bases from each Map process are joined to build a ensemble classifier. Experimental results show the improvement achieved by this model with respect to the standard Chi-FRBCS-BigData approach, and opens the way for promising future work on the topic.
Keywords: Big data | Data models | Programming | Training | Fuzzy systems | Pragmatics | Tuning
Knowledge Discovery for Smart Grid Operation, Control, and Situation Awareness: A Big Data Visualization Platform
کشف دانش برای هوش مصنوعی عملیات، کنترل و وضعیت هوشمند: یک پلت فرم تجسم داده بزرگ-2016
In this paper, a big data visualization platform is designed to discover the hidden useful knowledge for smart grid (SG) operation, control and situation awareness. The spawn of smart sensors at both grid side and customer side can provide large volume of heterogeneous data that collect information in all time spectrums. Extracting useful knowledge from this bigdata poll is still challenging. In this paper, the Apache Spark, an open source cluster computing framework, is used to process the big-data to effectively discover the hidden knowledge. A highspeed communication architecture utilizing the Open System Interconnection (OSI) model is designed to transmit the data to a visualization platform. This visualization platform uses Google Earth, a global geographic information system (GIS) to link the geological information with the SG knowledge and visualize the information in user defined fashion. The University of Denver’s campus grid is used as a SG test bench and several demonstrations are presented for the proposed platform.
Index terms: Big data | knowledge discovery | smart sensor | Apache Spark | geographic information system | parallel computation
An Adaptive Information-Theoretic Approach for Identifying Temporal Correlations in Big Data Sets
رویکرد تطبیقی اطلاعات نظری برای شناسایی موقت همبستگی در مجموعه داده های بزرگ-2016
In the past two decades, new developments in computing, sensing and crowdsourced data have resulted in an explosion in the availability of quantitative information. The possibilities of analyzing this so-called “big data” to inform research and the decision-making process are virtually endless. In general analyses have to be done across multiple data sets in order to bring out the most value of big data. A first important step is to identify temporal correlations between data sets. Given the characteristics of big data in term of volume and velocity, techniques that identify correlations not only need to be scalable, but also need to help users in ordering the correlation across temporal resolutions so that they can focus on important relationships. There is a large body of work in this area, however, most of them either only deal with small data sets, using a fixed temporal resolution, or does not provide a quantifiable measure of a correlation significance. In this paper, we present a method based on mutual information to identify correlations in large data sets. Discovered correlations are suggested to users in an order based on their significance. Our method supports an adaptive streaming technique that minimizes duplicated computation and is implemented on top of Apache Spark for scalability using big data platforms. We also provide a comprehensive evaluation using real-world data sets from NYC Open Data, and compare our findings against a recent study.
Keywords: temporal correlation | mutual information | BigData | adaptive sliding window | streaming
Spatial-Crowd: A Big Data Framework for Efficient Data Visualization
Spatial-Crowd: A Big Data Framework for Efficient Data Visualization-2016
Analyzing and visualizing large datasets generated by real-time spatio-temporal activities (e.g. vehicle mobility or large crowd movement) are a very challenging task. Recursive delays both at middleware and front end applications limit the of usefulness of the real-time analysis. In this paper, we present a framework ‘‘Spatial-Crowd’’ that first handles spatial-temporal data acquisition and processing by scaling up the middleware components and its infrastructure. Then, it enables filtering, fixing, enriching and summarising the acquired dataset, readily available for client interfaces which usually are not scalable or built to manage such large datasets. This framework follows published subscriber model and allows users to subscribe to aggregated data streams instead of requesting data in real time. The framework is tested with data generated by a very large simulated dataset and performance showed a significant data reduction on the client side to enhance data visualisation.
Keywords Bigdata: Data mining| Visualization| Mobility
ارزیابی مدل داده ی NOSQL سند گرا
سال انتشار: 2016 - تعداد صفحات فایل pdf انگلیسی: 6 - تعداد صفحات فایل doc فارسی: 24
روزانه اگزابایت داده براساس اطلاعات تولید شده کاربران در اینترنت تولید می شود . شبکه های اجتماعی ، دستگاه های موبایل ، ایمیل ، بلاگ ها ، ویدئو ، انتقالات بانکی و دیگر چیز ها کانال دیجیتالی جدیدی بین برند ها و مخاطبانشان برقرار کرده اند . ابزار قدرتمندی برای ذخیره و اکتشاف این داده های در حال توسعه رو افزون نیاز هستند ، به منظور ارسال پردازش راحت و قابل اطمینان روی اطلاعات کاربران . مدل های سنتی محدودیت های خود را در مواجه با این چالش نشان دادند ، که اطلاعات در حجم و تنوع در حال رشد هستند . این می تواند تنها به وسیله مدل های داده ای غیر ارتباطی اداره شود . از 10 سال قبل ، مدل داده سند گرا ، مدل داده ارتباطی استاندارد را تکان داده است (تغییر).ما در این مقاله به بررسی این مدل با استفاده از معیار های پیش تعریف شده نشان خواهیم داد .
کلمات کلیدی : مدل بندی داده | داده های بزرگ | NOSQL | سندگرا
|مقاله ترجمه شده|