Big data analytics for financial Market volatility forecast based on support vector machine
تجزیه و تحلیل داده های بزرگ برای پیش بینی نوسانات مالی بازار بر اساس دستگاه بردار پشتیبانی-2020
High-frequency data provides a lot of materials and broad research prospects for in-depth research and understanding on financial market behavior, but the problems solved in the research of high-frequency data are far less than the problems faced and encountered, and the research value of high-frequency data will be greatly reduced without solving these problems. Volatility is an important measurement index of market risk, and the research and forecasting on the volatility of high-frequency data is of great significance to investors, government regulators and capital markets. To this end, by modelling the jump volatility of high-frequency data, the shortterm volatility of high-frequency data are predicted.
Keywords: Big data | Financial market | Volatility | Support vector machine
Prediction of the ground temperature with ANN, LS-SVM and fuzzy LS-SVM for GSHP application
پیش بینی دمای زمین با شبکه های عصبی، LS-SVM و LS-SVM فازی برای استفاده GSHP-2020
Ground source heat pump (GSHP) system has received more and more attentions for its energy-conserving and environmental-friendly properties. Acquisition of the undisturbed ground temperature is the prerequisite for designing of GSHP system. Measurement by burying temperature sensors underground is the conventional means for obtaining the ground temperature data. However, this way is usually time consuming and high investment, and also easily encounter with certain technical difficulties. The rapid development of intelligent computation algorithm provides solutions for many realistic difficult problems. Basing on a great number of the measured data of the ground temperature from two boreholes with 100m depth located in Chongqing, ground temperature prediction models basing on artificial neural network (ANN) and support vector machine based on least square (LS-SVM) are established, respectively. And then, two kinds of validation works, i.e., holdout validation and k-fold validation are conducted toward the two models, respectively. Furthermore, a new method that correlating fuzzy theory with LS-SVM is proposed to solve the big computation burden problem encountered by LS-SVM model. By comparing with the above two models, it is concluded that the newly proposed model can not only improve the calculation speed obviously but also be able to promote the prediction accuracy, especially superior to the single LS-SVM model.
Keywords: Ground temperature | Fuzzy | Support vector machine | Ground source heat pump
Predicting and explaining corruption across countries: A machine learning approach
پیش بینی و توضیح فساد در سراسر کشور: رویکرد یادگیری ماشینی-2020
In the era of Big Data, Analytics, and Data Science, corruption is still ubiquitous and is perceived as one of the major challenges of modern societies. A large body of academic studies has attempted to identify and explain the potential causes and consequences of corruption, at varying levels of granularity, mostly through theoretical lenses by using correlations and regression-based statistical analyses. The present study approaches the phenomenon from the predictive analytics perspective by employing contemporary machine learning techniques to discover the most important corruption perception predictors based on enriched/enhanced nonlinear models with a high level of predictive accuracy. Specifically, within the multiclass classification modeling setting that is employed herein, the Random Forest (an ensemble-type machine learning algorithm) is found to be the most accurate prediction/classification model, followed by Support Vector Machines and Artificial Neural Networks. From the practical standpoint, the enhanced predictive power of machine learning algorithms coupled with a multi-source database revealed the most relevant corruption-related information, contributing to the related body of knowledge, generating actionable insights for administrator, scholars, citizens, and politicians. The variable importance results indicated that government integrity, property rights, judicial effectiveness, and education index are the most influential factors in defining the corruption level of significance
Keywords: Corruption perception | Machine learning | Predictive modeling | Random forest | Society policies and regulations |Government integrity | Social development
Predicting academic performance of students from VLE big data using deep learning models
پیش بینی عملکرد علمی دانش آموزان از داده های بزرگ VLE با استفاده از مدل های یادگیری عمیق-2020
The abundance of accessible educational data, supported by the technology-enhanced learning platforms, provides opportunities to mine learning behavior of students, addressing their issues, optimizing the educational environment, and enabling data-driven decision making. Virtual learning environments complement the learning analytics paradigm by effectively providing datasets for analysing and reporting the learning process of students and its reflection and contribution in their respective performances. This study deploys a deep artificial neural network on a set of unique handcrafted features, extracted from the virtual learning environments clickstream data, to predict at-risk students providing measures for early intervention of such cases. The results show the proposed model to achieve a classification accuracy of 84%–93%. We show that a deep artificial neural network outperforms the baseline logistic regression and support vector machine models. While logistic regression achieves an accuracy of 79.82%–85.60%, the support vector machine achieves 79.95%–89.14%. Aligned with the existing studies - our findings demonstrate the inclusion of legacy data and assessment-related data to impact the model significantly. Students interested in accessing the content of the previous lectures are observed to demonstrate better performance. The study intends to assist institutes in formulating a necessary framework for pedagogical support, facilitating higher education decision-making process towards sustainable education.
Keywords: Learning analytics | Predicting success | Educational data | Machine learning | Deep learning | Virtual learning environments (VLE)
Use of support vector machines with a parallel local search algorithm for data classification and feature selection
استفاده از ماشینهای بردار پشتیبانی با الگوریتم جستجوی محلی موازی برای طبقه بندی داده ها و انتخاب ویژگی ها-2020
Over the last decade, the number of studies on machine learning has significantly increased. One of the most widely researched areas of machine learning is data classification. Most big data systems require a large amount of information storage for analytic purposes; however, this involves some disadvantages, such as the costs of processing and collecting data. Thus, many researchers and practitioners are working on effectively reducing the number of features used in classification. This paper proposes a method which jointly optimizes both feature selection and classification. A survey of the relevant literature shows that the vast majority of studies focus on either feature selection or classification. In this study, the proposed parallel local search algorithm both selects features and finds a classifier with high rates of accuracy. Moreover, the proposed method is capable of finding solutions for problems that have extremely high numbers of features within a reasonable computation time.
Keywords: Support vector machines | Feature selection | Classification | Heuristic | Machine learning
A grid-quadtree model selection method for support vector machines
روش انتخاب مدل شبکه چهارگوش برای ماشینهای بردار پشتیبانی-2020
In this paper, a new model selection approach for Support Vector Machine (SVM), which integrates the quadtree technique with the grid search, denominated grid-quadtree (GQ) is proposed. The developed method is the first in the literature to apply the quadtree for the SVM parameters optimization. The SVM is a machine-learning technique for pattern recognition whose performance relies on its parameters determination. Thus, the model selection problem for SVM is an important field of study and requires expert and intelligent systems to solve it. Real classification data sets involve a huge number of instances and features, and the greater is the training data set dimension, the larger is the cost of a recognition system. The grid search (GS) is the most popular and the simplest method to select parameters for SVM. However, it is time-consuming, which limits its application for big-sized problems. With this in mind, the main idea of this research is to apply the quadtree technique to the GS to make it faster. Hence, this may lower computational time cost for solving problems such as bio-identification, bank credit risk and cancer detection. Based on the asymptotic behaviors of the SVM, it was noticeably observed that the quadtree is able to avoid the GS full search space evaluation. As a consequence, the GQ carries out fewer parameters analysis, solving the same problem with much more efficiency. To assess the GQ performance, ten classification benchmark data set were used. The obtained results were compared with the ones of the traditional GS. The outcomes showed that the GQ is able to find parameters that are as good as the GS ones, executing 78.8124% to 85.8415% fewer operations. This research points out that the adoption of quadtree expressively reduces the computational time of the original GS, making it much more efficient to deal with high dimensional and large data sets.
Keywords: Support vector machine | Parameter determination | Quadtree | Grid search
An empirical case study on Indian consumers sentiment towards electric vehicles: A big data analytics approach
یک مطالعه موردی تجربی در مورد احساسات مصرف کنندگان هندی نسبت به وسایل نقلیه برقی: یک رویکرد تحلیل داده های بزرگ-2020
Today, climate change due to global warming is a significant concern to all of us. Indias rate of greenhouse gas emissions is increasing day by day, placing India in the top ten emitters in the world. Air pollution is one of the significant contributors to the greenhouse effect. Transportation contributes about 10% of the air pollution in India. The Indian government is taking steps to reduce air pollution by encouraging the use of electric vehicles. But, success depends on consumers sentiment, perception and understanding towards Electric Vehicles (EV). This case study tried to capture the feeling, attitude, and emotions of Indian consumers towards electric vehicles. The main objective of this study was to extract opinions valuable to prospective buyers (to know what is best for them), marketers (for determining what features should be advertised) and manufacturers (for deciding what features should be improved) using Deep Learning techniques (e.g Doc2Vec Algorithm, Recurrent Neural Network (RNN), Convolutional Neural Network (CNN)). Due to the very nature of social media data, big data platform was chosen to analyze the sentiment towards EV. Deep Learning based techniques were preferred over traditional machine learning algorithms (Support Vector Machine, Logistic regression and Decision tree, etc.) due to its superior text mining capabilities. Two years data (2016 to 2018) were collected from different social media platform for this case study. The results showed the efficiency of deep learning algorithms and found CNN yield better results in-compare to others. The proposed optimal model will help consumers, designers and manufacturers in their decision-making capabilities to choose, design and manufacture EV.
Keywords: Electric vehicles | Deep learning | Big data | Sentiment analysis | India
A novel spatio-temporal wind power forecasting framework based on multi-output support vector machine and optimization strategy
چارچوب پیش بینی نیروی باد مکانی و مکانی رمان بر اساس ماشین بردار پشتیبانی چند خروجی و استراتژی بهینه سازی-2020
The integration of a large number of wind farms poses big challenges to the secure and economical operation of power systems, and ultra-short-term wind power forecasting is an effective solution. However, traditional approaches can only predict an individual wind farm power at a time and ignore the spatio-temporal correlation of wind farms. In this paper, a novel ultra-short-term forecasting framework based on spatio-temporal (ST) analysis, multi-output support vector machine (MSVM) and grey wolf optimizer (GWO) which defined ST-GWO-MSVM model is proposed to predict the output wind power from multiple wind farms; the ST-GWO-MSVM model includes data analysis stage, parameters optimization stage, and modeling stage. In the data analysis stage, the person correlation coefficient and partial autocorrelation function are used to analyze the spatio-temporal correlation of wind power. In the parameters optimization stage, to avoid obtaining the unreliable forecasting results due to the parameters are chosen empirically, the GWO algorithm is used to optimize the kernel function parameters of the MSVM model. In the modeling stage, an innovative forecasting model with optimal parameter of MSVM is proposed to predict the output wind power of 15 wind farms. Results show that the performance of ST-GWO-MSVM is better than other benchmark models in terms of multiple-error metrics including fractional bias, direction accuracy, and improvement percentages.
Keywords: wind power forecasting | Spatio-temporal correlation | Multi-output support vector machine | Grey wolf optimizer | Combined forecasting approaches
Identifying non-O157 Shiga toxin-producing Escherichia coli (STEC) using deep learning methods with hyperspectral microscope images
شناسایی اشرشیا کولی تولید کننده سم غیر شیتا Sh157 با استفاده از روشهای یادگیری عمیق با تصاویر میکروسکوپ فوق قطبی-2020
Non-O157 Shiga toxin-producing Escherichia coli (STEC) serogroups such as O26, O45, O103, O111, O121 and O145 often cause illness to people in the United States and the conventional identification of these “Big-Six” are complex. The label-free hyperspectral microscope imaging (HMI) method, which provides spectral “fingerprints” information of bacterial cells, was employed to classify serogroups at the cellular level. In spectral analysis, principal component analysis (PCA) method and stacked auto-encoder (SAE) method were conducted to extract principal spectral features for classification task. Based on these features, multiple classifiers including linear discriminant analysis (LDA), support vector machine (SVM) and soft-max regression (SR) methods were evaluated. Different sizes of datasets were also tested in search for the suitable classification models. Among the results, SAE-based classification models performed better than PCA-based models, achieving classification accuracy of SAE-LDA (93.5%), SAE-SVM (94.9%) and SAE-SR (94.6%), respectively. In contrast, classification results of PCA-based methods such as PCA-LDA, PCA-SVM and PCA-SR were only 75.5%, 85.7% and 77.1%, respectively. The results also suggested the increasing number of training samples have positive effects on classification models. Taking advantage of increasing dataset, the SAE-SR classification model finally performed better than others with average accuracy of 94.9% in classifying STEC serogroups. Specifically, O103 serogroup was classified with the highest accuracy of 97.4%, followed by O111 (96.5%), O26 (95.3%), O121 (95%), O145 (92.9%) and O45 (92.4%), respectively. Thus, the HMI technology coupled with SAE-SR classification model has the potential for “Big-Six” identification.
Keywords: Foodborne bacteria | Classification | Food safety | Machine learning | Stacked auto-encoder | Optical method
Oil palm mapping over Peninsular Malaysia using Google Earth Engine and machine learning algorithms
نقشه برداری روغن نخل در شبه جزیره مالزی با استفاده از موتور زمین گوگل و الگوریتم های یادگیری ماشین-2020
Oil palm plays a pivotal role in the ecosystem, environment, economy and without proper monitoring, uncontrolled oil palm activities could contribute to deforestation that can cause high negative impacts on the environment and therefore, proper management and monitoring of the oil palm industry are necessary. Mapping the distribution of oil palm is crucial in order to manage and plan the sustainable operations of oil palm plantations. Remote sensing provides a means to detect and map oil palm from space effectively. Recent advances in cloud computing and big data allow rapid mapping to be performed over large a geographical scale. In this study, 30 m Landsat 8 data were processed using a cloud computing platform of Google Earth Engine (GEE) in order to classify oil palm land cover using non-parametric machine learning algorithms such as Support Vector Machine (SVM), Classification and Regression Tree (CART) and Random Forest (RF) for the first time over Peninsular Malaysia. The hyperparameters were tuned, and the overall accuracy produced by the SVM, CART and RF were 93.16%, 80.08% and 86.50% respectively. Overall, the SVM classified the 7 classes (water, built-up, bare soil, forest, oil palm, other vegetation and paddy) the best. However, RF extracted oil palm information better than the SVM. The algorithms were compared and the McNemar’s test showed significant values for comparisons between SVM and CART and RF and CART. On the other hand, the performance of SVM and RF are considered equally effective. Despite the challenges in implementing machine learning optimisation using GEE over a large area, this paper shows the efficiency of GEE as a cloud-based free platform to perform bioresource distributions mapping such as oil palm over a large area in Peninsular Malaysia.
Keywords: cloud computing | image classification | Landsat | machine learning | oil palm