با سلام خدمت کاربران عزیز، به اطلاع می رساند ترجمه مقالاتی که سال انتشار آن ها زیر 2008 می باشد رایگان بوده و میتوانید با وارد شدن در صفحه جزییات مقاله به رایگان ترجمه را دانلود نمایید.
Pivot-based approximate k-NN similarity joins for big high-dimensional data
پیوندهای شباهت تقریبی k-NN مبتنی بر محوری برای داده های بزرگ با ابعاد بزرگ-2020
Given an appropriate similarity model, the k-nearest neighbor similarity join represents a useful yet costly operator for data mining, data analysis and data exploration applications. The time to evaluate the operator depends on the size of datasets, data distribution and the dimensionality of data representations. For vast volumes of high-dimensional data, only distributed and approximate approaches make the joins practically feasible. In this paper, we investigate and evaluate the performance of multiple MapReduce-based approximate k-NN similarity join approaches on two leading Big Data systems Apache Hadoop and Spark. Focusing on the metric space approach relying on reference dataset objects (pivots), this paper investigates distributed similarity join techniques with and without approximation guarantees and also proposes high-dimensional extensions to previously proposed algorithms. The paper describes the design guidelines, algorithmic details, and key theoretical underpinnings of the compared approaches and also presents the empirical performance evaluation, approximation precision, and scalability properties of the implemented algorithms. Moreover, the Spark source code of all these algorithms has been made publicly available. Key findings of the experimental analysis are that randomly initialized pivot-based methods perform well with big highdimensional data and that, in general, the selection of the best algorithm depends on the desired levels of approximation guarantee, precision and execution time.
Keywords: Hadoop | Spark | MapReduce | k-NN | Approximate similarity join | High-dimensional data
Stochastic parallel extreme artificial hydrocarbon networks: An implementation for fast and robust supervised machine learning in high-dimensional data
شبکه های هیدروکربنی مصنوعی موازی تصادفی تصادفی: پیاده سازی برای یادگیری ماشین تحت نظارت سریع و قوی در داده های با ابعاد بالا-2020
Artificial hydrocarbon networks (AHN) – a supervised learning method inspired on organic chemical structures and mechanisms – have shown improvements in predictive power and interpretability in comparison with other well-known machine learning models. However, AHN are very time-consuming that are not able to deal with large data until now. In this paper, we introduce the stochastic parallel extreme artificial hydrocarbon networks (SPE-AHN), an algorithm for fast and robust training of supervised AHN models in high-dimensional data. This training method comprises a population-based meta-heuristic optimization with defined individual encoding and objective function related to the AHN-model, an implementation in parallel-computing, and a stochastic learning approach for consuming large data. We conducted three experiments with synthetic and real data sets to validate the training execution time and performance of the proposed algorithm. Experimental results demonstrated that the proposed SPE-AHN outperforms the original-AHN method, increasing the speed of training more than 10, 000???? times in the worst case scenario. Additionally, we present two case studies in real data sets for solar-panel deployment prediction (regression problem), and human falls and daily activities classification in healthcare monitoring systems (classification problem). These case studies showed that SPEAHN improves the state-of-the-art machine learning models in both engineering problems. We anticipate our new training algorithm to be useful in many applications of AHN like robotics, finance, medical engineering, aerospace, and others, in which large amounts of data (e.g. big data) is essential.
Keywords: Machine learning | Parallel computing | Extreme learning machines | Stochastic learning | Regression | Classification | Big data
Control-Based Algorithms for High Dimensional Online Learning
الگوریتم های مبتنی بر کنترل برای یادگیری آنلاین با ابعاد بالا-2020
In the era of big data, the high-dimensional online learning problems require huge computing power. This paper proposes a novel approach for high-dimensional online learning. Two new algorithms are developed for online high-dimensional regression and classification problems respectively. The problems are formulated as feedback control problems for some low dimensional systems. The novel learning algorithms are then developed via the control problems. Via an efficient polar decomposition, we derive the explicit solutions of the control problems, substantially reducing the corresponding computational complexity, especially for high dimensional largescale data streams. Comparing with conventional methods, the new algorithm can achieve more robust and accurate performance with faster convergence. This paper demonstrates that optimal control can be an effective approach for developing high dimensional learning algorithms. We have also for the first time proposed a control-based robust algorithm for classification problems. Numerical results support our theory and illustrate the efficiency of our algorithm.
Keywords: Classification | high dimensional dataset | model predictive control | online learning | robust control
High-dimensional image descriptor matching using highly parallel KD-tree construction and approximate nearest neighbor search
توصیفگر تصویر با ابعاد بالا با استفاده از ساخت درخت KD بسیار موازی و نزدیکترین جستجوی تقریبی نزدیکترین همسایه-2019
To overcome the high computational cost associated with the high-dimensional digital image descriptor matching, this paper presents a set of integrated parallel algorithms for the construction of K-dimensional tree (KD-tree) and P approximate nearest neighbor search (P-ANNS) on the modern massively parallel architectures (MPA). To improve the runtime performance of the P-ANNS, we propose an efficient sliding window for a parallel buffered P-ANNS on KD-tree to mitigate the high cost of global memory accesses. When applied to high dimensional real-world image descriptor datasets, the proposed KD-tree construction and the buffered P-ANNS algorithms are of comparable matching quality to the traditional sequential counterparts on CPU, while outperforming their serial CPU counterparts by speedup factors of up to 17and 163, respectively. The algorithms are also studied for the performance impact factors to obtain the optimal runtime configurations for various datasets. Moreover, we verify the features of the parallel algorithms on typical 3D image matching scenarios. With the classical local image descriptor signature of histograms of orientations (SHOT) datasets, the parallel KD-tree construction and image descriptor matching can achieve up to 11 and 138-fold speedups, respectively
Keywords: KD-tree | Approximate nearest neighbor search | Parallel algorithm | GPU | CUDA | Image matching | Pattern recognition
Default prediction in P2P lending from high-dimensional data based on machine learning
پیش بینی به طور پیش فرض در اعطای وام P2P از داده های با ابعاد بالا بر اساس یادگیری ماشین-2019
In recent years, a new Internet-based unsecured credit model, peer-to-peer (P2P) lending, is flourishing and has become a successful complement to the traditional credit business. However, credit risk remains inevitable. A key challenge is creating a default prediction model that can effectively and accurately predict the default probability of each loan for a P2P lending platform. Due to the characteristics of P2P lending credit data, such as high dimension and class imbalance, conventional statistical models and machine learning algorithms cannot effectively and accurately predict default probability. To address this issue, a decision tree model-based heterogeneous ensemble default prediction model is proposed in this paper for accurate prediction of customer default in P2P lending. Gradient boosting decision trees (GBDT), extreme gradient boosting (XGBoost) and light gradient boosting machine (LightGBM) are employed as individual classifiers to create a heterogeneous ensemble learning-based default prediction model. Learning model-based feature ranking is applied to P2P lending credit data, and individual classifiers undergo hyperparameter optimization. Finally, comparison with benchmark models shows that the prediction model can achieve desirable prediction results and thus effectively solve the challenge of predictions based on high-dimensional and imbalanced data.
Keywords: Default prediction | High-dimensional data | Imbalanced data | Machine learning | P2P lending
A deep learning solution approach for high-dimensional random differential equations
یک روش راه حل یادگیری عمیق برای معادلات دیفرانسیل تصادفی با ابعاد بالا-2019
Developing efficient numerical algorithms for the solution of high dimensional random Partial Differential Equations (PDEs) has been a challenging task due to the well-known curse of dimensionality. We present a new solution approach for these problems based on deep learning. This approach is intrusive, entirely unsupervised, and mesh-free. Specifically, the random PDE is approximated by a feed-forward fully-connected deep residual network, with either strong or weak enforcement of initial and boundary constraints. Parameters of the approximating deep neural network are determined iteratively using variants of the Stochastic Gradient Descent (SGD) algorithm. The satisfactory accuracy of the proposed approach is numerically demonstrated on diffusion and heat conduction problems, in comparison with the converged Monte Carlo-based finite element results.
Keywords: Deep learning | Deep neural networks | Residual networks | Random differential equations | Curse of dimensionality | Least squares
A review of feature selection methods in medical applications
مروری بر روشهای انتخاب ویژگی در برنامه های پزشکی-2019
Feature selection is a preprocessing technique that identifies the key features of a given problem. It has traditionally been applied in a wide range of problems that include biological data processing, finance, and intrusion detection systems. In particular, feature selection has been successfully used in medical applications, where it can not only reduce dimensionality but also help us understand the causes of a disease. We describe some basic concepts related to medical applications and provide some necessary background information on feature selection. We review the most recent feature selection methods developed for and applied in medical problems, covering prolific research fields such as medical imaging, biomedical signal processing, and DNA microarray data analysis. A case study of two medical applications that includes actual patient data is used to demonstrate the suitability of applying feature selection methods in medical problems and to illustrate how these methods work in real-world scenarios.
Keywords: Feature selection | High dimensionality | Pattern recognition | Medical imaging | Biomedical data
Hybrid fast unsupervised feature selection for high-dimensional data
انتخاب ویژگی بدون نظارت هیبریدی سریع برای داده های با ابعاد بالا-2019
The emergence of “curse of dimensionality”issue as a result of high reduces datasets deteriorates the ca- pability of learning algorithms, and also requires high memory and computational costs. Selection of fea- tures by discarding redundant and irrelevant features functions as a crucial machine learning technique aimed at reducing the dimensionality of these datasets, which improves the performance of the learning algorithm. Feature selection has been extensively applied in many application areas relevant to expert and intelligent systems, such as data mining and machine learning. Although many algorithms have been developed so far, they are still unsatisfying confronting high-dimensional data. This paper presented a new hybrid filter-based feature selection algorithm based on acombination of clustering and the modi- fied Binary Ant System (BAS), called FSCBAS, to overcome the search space and high-dimensional data processing challenges efficiently. This model provided both global and local search capabilities between and within clusters. In the proposed method, inspired by genetic algorithm and simulated annealing, a damped mutation strategy was introduced that avoided falling into local optima, and a new redundancy reduction policy adopted to estimate the correlation between the selected features further improved the algorithm. The proposed method can be applied in many expert system applications such as microar- ray data processing, text classification and image processing in high-dimensional data to handle the high dimensionality of the feature space and improve classification performance simultaneously. The perfor- mance of the proposed algorithm was compared to that of state-of-the-art feature selection algorithms using different classifiers on real-world datasets. The experimental results confirmed that the proposed method reduced computational complexity significantly, and achieved better performance than the other feature selection methods.
Keywords: Feature selection | High-dimensional data | Binary ant system | Clustering | Mutation
PUMA: Parallel subspace clustering of categorical data using multi-attribute weights
PUMA: خوشه بندی موازی زیر فضای داده های دسته ای با استفاده از وزنهای چند صفته-2019
There are two main reasons why traditional clustering schemes are incompetent for high-dimensional categorical data. First, traditional methods usually represent each cluster by all dimensions without dif- ference; and second, traditional clustering methods only rely on an individual dimension of projection as an attribute’s weight ignoring relevance among attributes. We solve these two problems by a MapReduce- based subspace clustering algorithm (called PUMA ) using multi-attribute weights. The attribute subspaces are constructed in our PUMA by calculating an attribute-value weight based on the co-occurrence prob- ability of attribute values among different dimensions. PUMA obtains sub-clusters corresponding to re- spective attribute subspaces from each computing node in parallel. Lastly, PUMA measures various scale clusters by applying the hierarchical clustering method to iteratively merge sub-clusters. We implement PUMA on a 24-node Hadoop cluster. Experimental results reveal that using multi-attribute weights with subspace clustering can achieve better clustering accuracy on both synthetic and real-world high dimen- sional datasets. Experimental results also show that PUMA achieves high performance in terms of exten- sibility, scalability and the nearly linear speedup with respect to number of nodes. Additionally, exper- imental results demonstrate that PUMA is reasonable, effective, and practical to expert systems such as knowledge acquisition, word sense disambiguation, automatic abstracting and recommender systems.
Keywords: Parallel subspace clustering | Multi-attribute weights | High dimension | Categorical data | MapReduce
Feature grouping-based parallel outlier mining of categorical data using spark
کاوش موازی غیرمترقبه موازی مبتنی بر گروه بندی ویژگی با استفاده از spark-2019
This paper proposes a feature-grouping based parallel outlier mining method called POS for high-dimensional categorical datasets. Existing methods of outlier mining are inadequate to deal with datasets which are so voluminous and complex. We solve this problem by proposing a parallel framework using the Spark platform for categorical and mass data. POS is composed of two modules, which are parallel feature grouping, and parallel outlier mining. Additionally, Vertical transformation is utilized to improve the performance of POS . We implement our POS on the Spark platform and evaluate it using synthetic and real- world datasets. Our experimental results confirm that POS is a promising and practical parallel algorithm to mine outliers in high-dimensional categorical datasets because POS achieves high performance in terms of extensibility and scalability.
Keywords: Parallel outlier mining | High-dimensional categorical data | Feature grouping | Feature relation | Spark