Ignis: An efficient and scalable multi-language Big Data framework
Ignis: یک چارچوب داده های بزرگ چند زبانه کارآمد و مقیاس پذیر-2020
Most of the relevant Big Data processing frameworks (e.g., Apache Hadoop, Apache Spark) only support JVM (Java Virtual Machine) languages by default. In order to support non-JVM languages, subprocesses are created and connected to the framework using system pipes. With this technique, the impossibility of managing the data at thread level arises together with an important loss in the performance. To address this problem we introduce Ignis, a new Big Data framework that benefits from an elegant way to create multi-language executors managed through an RPC system. As a consequence, the new system is able to execute natively applications implemented using non-JVM languages. In addition, Ignis allows users to combine in the same application the benefits of implementing each computational task in the best suited programming language without additional overhead. The system runs completely inside Docker containers, isolating the execution environment from the physical machine. A comparison with Apache Spark shows the advantages of our proposal in terms of performance and scalability.
Keywords: Big data | Multi-language | Performance | Scalability | Container
Heterogeneous tree structure classification to label Java programmers according to their expertise level
طبقه بندی ساختار درخت ناهمگن به برچسب برنامه نویسان جاوا با توجه به سطح تخصص آنها-2020
Open-source code repositories are a valuable asset to creating different kinds of tools and services, utilizing machine learning and probabilistic reasoning. Syntactic models process Abstract Syntax Trees (AST) of source code to build systems capable of predicting different software properties. The main difficulty of building such models comes from the heterogeneous and compound structures of ASTs, and that traditional machine learning algorithms require instances to be represented as n-dimensional vectors rather than trees. In this article, we propose a new approach to classify ASTs using traditional supervised-learning algorithms, where a feature learning process selects the most representative syntax patterns for the child subtrees of different syntax constructs. Those syntax patterns are used to enrich the context information of each AST, allowing the classification of compound heterogeneous tree structures. The proposed approach is applied to the problem of labeling the expertise level of Java programmers. The system is able to label expert and novice programs with an average accuracy of 99.6%. Moreover, other code fragments such as types, fields, methods, statements and expressions could also be classified, with average accuracies of 99.5%, 91.4%, 95.2%, 88.3% and 78.1%, respectively.
Keywords: Big code | Machine learning | Syntax patterns | Abstract syntax trees | Programmer expertise | Decision trees | Big data
High-performance spatiotemporal trajectory matching across heterogeneous data sources
کارایی بالا تطبیق مسیر مکانی و مکانی در منابع داده ناهمگن - سایپرز ، باشگاه دانش-2020
In the era of big data, the movement of the same object or person can be recorded by different devices with different measurement accuracies and sampling rates. Matching and conflating these heterogeneous trajectories help to enhance trajectory semantics, describe user portraits, and discover specified groups from human mobility. In this paper, we proposed a high-performance approach for matching spatiotemporal trajectories across heterogeneous massive datasets. Two indicators, i.e., Time Weighted Similarity (TWS) and Space Weighted Similarity (SWS), are proposed to measure the similarity of spatiotemporal trajectories. The core idea is that trajectories are more similar if they stay close in a longer time and distance. A distributed computing framework based on Spark is built for efficient trajectory matching among massive datasets. In the framework, the trajectory segments are partitioned into 3-dimensional space–time cells for parallel processing, and a novel method of segment reference point is designed to avoid duplicated computation. We conducted extensive matching experiments on real-world and synthetic trajectory datasets. The experimental results illustrate that the proposed approach outperforms other similarity metrics in accuracy, and the Spark-based framework greatly improves the efficiency in spatiotemporal trajectory matching.
Keywords: Distributed computing | Spatiotemporal big data | Trajectory similarity | Trajectory matching
Accelerated iterative image reconstruction for cone-beam computed tomography through Big Data frameworks
بازسازی تکرار تصویر تکرار شده برای توموگرافی محاسبه شده با پرتو مخروطی از طریق چارچوب داده های بزرگ-2020
One of the latest trends in Computed Tomography (CT) is the reduction of the radiation dose delivered to patients through the decrease of the amount of acquired data. This reduction results in artifacts in the final images if conventional reconstruction methods are used, making it advisable to employ iterative algorithms to enhance image quality. Most approaches are built around two main operators, backprojection and projection, which are computationally expensive. In this work, we present an implementation of those operators for iterative reconstruction methods exploiting the Big Data paradigm. We define an architecture based on Apache Spark that supports both Graphical Processing Units (GPU) and CPU-based architectures. The aforementioned are parallelized using a partitioning scheme based on the division of the volume and irregular data structures in order to reduce the cost of communication and computation of the final images. Our solution accelerates the execution of the two most computational expensive components with Apache Spark, improving the programming experience of new iterative reconstruction algorithms and the maintainability of the source code increasing the level of abstraction for non-experienced high performance programmers. Through an experimental evaluation, we show that we can obtain results up to 10× faster for projection and 21× faster for backprojection when using a GPUbased cluster compared to a traditional multi-core version. Although a linear speed up was not reached, the proposed approach can be a good alternative for porting previous medical image reconstruction applications already implemented in C/C++ or even with CUDA or OpenCL programming models. Our solution enables the automatic detection of the GPU devices and execution on CPU and GPU tasks at the same time under the same system, using all the available resources
Keywords : Apache | Spark | GPU | Medical | image | processing | Computed | tomography | Iterative | reconstruction | algorithms
Distributed learning on 20 000+ lung cancer patients – The Personal Health Train
یادگیری توزیع شده بر روی 20 000+ بیمار مبتلا به سرطان ریه - آموزش بهداشت شخصی-2020
Background and purpose: Access to healthcare data is indispensable for scientific progress and innovation. Sharing healthcare data is time-consuming and notoriously difficult due to privacy and regulatory concerns. The Personal Health Train (PHT) provides a privacy-by-design infrastructure connecting FAIR (Findable, Accessible, Interoperable, Reusable) data sources and allows distributed data analysis and machine learning. Patient data never leaves a healthcare institute. Materials and methods: Lung cancer patient-specific databases (tumor staging and post-treatment survival information) of oncology departments were translated according to a FAIR data model and stored locally in a graph database. Software was installed locally to enable deployment of distributed machine learning algorithms via a central server. Algorithms (MATLAB, code and documentation publicly available) are patient privacy-preserving as only summary statistics and regression coefficients are exchanged with the central server. A logistic regression model to predict post-treatment two-year survival was trained and evaluated by receiver operating characteristic curves (ROC), root mean square prediction error (RMSE) and calibration plots. Results: In 4 months, we connected databases with 23 203 patient cases across 8 healthcare institutes in 5 countries (Amsterdam, Cardiff, Maastricht, Manchester, Nijmegen, Rome, Rotterdam, Shanghai) using the PHT. Summary statistics were computed across databases. A distributed logistic regression model predicting post-treatment two-year survival was trained on 14 810 patients treated between 1978 and 2011 and validated on 8 393 patients treated between 2012 and 2015. Conclusion: The PHT infrastructure demonstrably overcomes patient privacy barriers to healthcare data sharing and enables fast data analyses across multiple institutes from different countries with different regulatory regimens. This infrastructure promotes global evidence-based medicine while prioritizing patient privacy.
Keywords: Lung cancer | Big data | Distributed learning | Federated learning | Machine learning | Survival analysis | Prediction modeling | FAIR data
Programming languages for data-Intensive HPC applications: A systematic mapping study
زبان های برنامه نویسی برای برنامه های HPC با داده های فشرده: یک مطالعه نگاشت منظم-2020
A major challenge in modelling and simulation is the need to combine expertise in both software tech- nologies and a given scientific domain. When High-Performance Computing (HPC) is required to solve a scientific problem, software development becomes a problematic issue. Considering the complexity of the software for HPC, it is useful to identify programming languages that can be used to alleviate this issue. Because the existing literature on the topic of HPC is very dispersed, we performed a Systematic Mapping Study (SMS) in the context of the European COST Action cHiPSet. This literature study maps character- istics of various programming languages for data-intensive HPC applications, including category, typical user profiles, effectiveness, and type of articles. We organised the SMS in two phases. In the first phase, relevant articles are identified employing an automated keyword-based search in eight digital libraries. This lead to an initial sample of 420 papers, which was then narrowed down in a second phase by human inspection of article abstracts, titles and keywords to 152 relevant articles published in the period 2006–2018. The analysis of these articles en- abled us to identify 26 programming languages referred to in 33 of relevant articles. We compared the outcome of the mapping study with results of our questionnaire-based survey that involved 57 HPC ex- perts. The mapping study and the survey revealed that the desired features of programming languages for data-intensive HPC applications are portability, performance and usability. Furthermore, we observed that the majority of the programming languages used in the context of data-intensive HPC applications are text-based general-purpose programming languages. Typically these have a steep learning curve, which makes them difficult to adopt. We believe that the outcome of this study will inspire future research and development in programming languages for data-intensive HPC applications.
Keywords: High performance computing (HPC) | Big data | Data-intensive applications | Programming languages | Domain-Specific language (DSL) | General-Purpose language (GPL) | Systematic mapping study (SMS)
پیش بینی ورود گردشگران از طریق یادگیری ماشین و شاخص جستجوی اینترنتی
سال انتشار: 2019 - تعداد صفحات فایل pdf انگلیسی: 10 - تعداد صفحات فایل doc فارسی: 38
مطالعات قبلی نشان داده است که داده های آنلاین، مانند پرس وجوهای انجام شده در موتورهای جستجو، یک منبع اطلاعاتی جدید محسوب می شوند که می توانند برای پیش بینی تقاضای گردشگری مورد استفاده قرار گیرند. در این مطالعه، ما چارچوبی را برای این پیش بینی ارائه می دهیم که با استفاده از یادگیری ماشین و شاخص های جستجوی اینترنتی، ورود گردشگران به مکان های محبوب چین را پیش بینی می کند و عملکرد این پیش بینی، را به ترتیب با نتایج جستجوی تولید شده توسط گوگل و بایدو مقایسه می کنیم. این تحقیق، علیت گرانجر و همبستگیِ میانِ شاخص جستجوی اینترنتی و ورود گردشگران به پکن را تایید می کند. نتایج تجربی ما نشان می دهد که عملکردِ پیش-بینیِ مدل های پیشنهادیِ هسته ی ماشین یادگیری افراطی (KELM )، که مجموعه هایی از گردشگران را با شاخص بایدو و شاخص گوگل ادغام می کنند، در مقایسه با مدل های معیار، به میزان قابل توجهی از نظر دقت پیش بینی و قدرت تحلیل ، بهتر بوده اند.
کلمه های کلیدی: پیش بینی تقاضای گردشگری | هسته ی ماشین یادگیری افراطی | جستجوی داده-های پرس وجو | تحلیل داده های بزرگ | شاخص جستجوی ترکیبی.
|مقاله ترجمه شده|
Big Data Creates New Opportunities for Materials Research: A Review on Methods and Applications of Machine Learning for Materials Design
داده های بزرگ فرصت های جدیدی را برای تحقیقات مواد ایجاد می کنند: مروری بر روش ها و کاربردهای یادگیری ماشین برای طراحی مواد-2019
Materials development has historically been driven by human needs and desires, and this is likely to continue in the foreseeable future. The global population is expected to reach ten billion by 2050, which will promote increasingly large demands for clean and high-efficiency energy, personalized consumer products, secure food supplies, and professional healthcare. New functional materials that are made and tailored for targeted properties or behaviors will be the key to tackling this challenge. Traditionally, advanced materials are found empirically or through experimental trial-and-error approaches. As big data generated by modern experimental and computational techniques is becoming more readily available, data-driven or machine learning (ML) methods have opened new paradigms for the discovery and rational design of materials. In this review article, we provide a brief introduction on various ML methods and related software or tools. Main ideas and basic procedures for employing ML approaches in materials research are highlighted. We then summarize recent important applications of ML for the large-scale screening and optimal design of polymer and porous materials, catalytic materials, and energetic materials. Finally, concluding remarks and an outlook are provided.
Keywords: Big data | Data-driven | Machine learning | Materials screening | Materials design
Discovering unusual structures from exception using big data and machine learning techniques
کشف ساختارهای غیر معمول از استثناء با استفاده از داده های بزرگ و تکنیک های یادگیری ماشین-2019
Recently, machine learning (ML) has become a widely used technique in materials science study. Most work focuses on predicting the rule and overall trend by building a machine learning model. However, new insights are often learnt from exceptions against the overall trend. In this work, we demonstrate that how unusual structures are discovered from exceptions when machine learning is used to get the relationship between atomic and electronic structures based on big data from high-throughput calculation database. For example, after training an ML model for the relationship between atomic and electronic structures of crystals, we find AgO2F, an unusual structure with both Ag3+ and O2 2 , from structures whose band gap deviates much from the prediction made by our model. A further investigation on this structure might shed light into the research on anionic redox in transition metal oxides of Li-ion batterie.
Keywords: Machine learning | Gradient boosting decision tree | Band gap | Unusual structures
Times-series data augmentation and deep learning for construction equipment activity recognition
تقویت داده های سری زمانی و یادگیری عمیق برای شناخت فعالیت تجهیزات ساختمانی-2019
Automated, real-time, and reliable equipment activity recognition on construction sites can help to minimize idle time, improve operational efficiency, and reduce emissions. Previous efforts in activity recognition of construction equipment have explored different classification algorithms anm accelerometers and gyroscopes. These studies utilized pattern recognition approaches such as statistical models (e.g., hidden-Markov models); shallow neural networks (e.g., Artificial Neural Networks); and distance algorithms (e.g., K-nearest neighbor) to classify the time-series data collected from sensors mounted on the equipment. Such methods necessitate the segmentation of continuous operational data with fixed or dynamic windows to extract statistical features. This heuristic and manual feature extraction process is limited by human knowledge and can only extract human-specified shallow features. However, recent developments in deep neural networks, specifically recurrent neural network (RNN), presents new opportunities to classify sequential time-series data with recurrent lateral connections. RNN can automatically learn high-level representative features through the network instead of being manually designed, making it more suitable for complex activity recognition. However, the application of RNN requires a large training dataset which poses a practical challenge to obtain from real construction sites. Thus, this study presents a data-augmentation framework for generating synthetic time-series training data for an RNN-based deep learning network to accurately and reliably recognize equipment activities. The proposed methodology is validated by generating synthetic data from sample datasets, that were collected from two earthmoving operations in the real world. The synthetic data along with the collected data were used to train a long short-term memory (LSTM)-based RNN. The trained model was evaluated by comparing its performance with traditionally used classification algorithms for construction equipment activity recognition. The deep learning framework presented in this study outperformed the traditionally used machine learning classification algorithms for activity recognition regarding model accuracy and generalization.
Keywords: Construction equipment activity recognition | Inertial measurement unit | Deep learning | Time-series data augmentation | LSTM network | Big data analytics