دانلود و نمایش مقالات مرتبط با GPUs::صفحه 1
دانلود بهترین مقالات isi همراه با ترجمه فارسی 2

با سلام خدمت کاربران در صورتی که با خطای سیستم پرداخت بانکی مواجه شدید از طریق کارت به کارت (6037997535328901 بانک ملی ناصر خنجری ) مقاله خود را دریافت کنید (تا مشکل رفع گردد). 

نتیجه جستجو - GPUs

تعداد مقالات یافته شده: 34
ردیف عنوان نوع
1 Benchmarking vision kernels and neural network inference accelerators on embedded platforms
محک زدن هسته بینایی و شتاب دهنده های استنتاج شبکه عصبی بر روی سیستم عامل های توکار-2021
Developing efficient embedded vision applications requires exploring various algorithmic optimization trade- offs and a broad spectrum of hardware architecture choices. This makes navigating the solution space and finding the design points with optimal performance trade-offs a challenge for developers. To help provide a fair baseline comparison, we conducted comprehensive benchmarks of accuracy, run-time, and energy efficiency of a wide range of vision kernels and neural networks on multiple embedded platforms: ARM57 CPU, Nvidia Jetson TX2 GPU and Xilinx ZCU102 FPGA. Each platform utilizes their optimized libraries for vision kernels (OpenCV, Vision Works and xfOpenCV) and neural networks (OpenCV DNN, TensorRT and Xilinx DPU). Forvision kernels, our results show that the GPU achieves an energy/frame reduction ratio of 1.1–3.2× compared to the others for simple kernels. However, for more complicated kernels and complete vision pipelines, the FPGA outperforms the others with energy/frame reduction ratios of 1.2–22.3×. For neural networks [Inception-v2 and ResNet-50, ResNet-18, Mobilenet-v2 and SqueezeNet], it shows that the FPGA achieves a speed up of [2.5, 2.1, 2.6, 2.9 and 2.5]× and an EDP reduction ratio of [1.5, 1.1, 1.4, 2.4 and 1.7]× compared to the GPU FP16 implementations, respectively.
Keywords: Benchmarks | CPUs | GPUs | FPGAs | Embedded vision | Neural networks
مقاله انگلیسی
2 Benchmarking the Performance and Energy Efficiency of AI Accelerators for AI Training
معیار عملکرد و بهره وری انرژی شتاب دهنده های هوش مصنوعی برای آموزش هوش مصنوعی-2020
Deep learning has become widely used in complex AI applications. Yet, training a deep neural network (DNNs) model requires a considerable amount of calculations, long running time, and much energy. Nowadays, many-core AI accelerators (e.g., GPUs and TPUs) are designed to improve the performance of AI training. However, processors from different vendors perform dissimilarly in terms of performance and energy consumption. To investigate the differences among several popular off-theshelf processors (i.e., Intel CPU, NVIDIA GPU, AMD GPU, and Google TPU) in training DNNs, we carry out a comprehensive empirical study on the performance and energy efficiency of these processors 1 by benchmarking a representative set of deep learning workloads, including computation-intensive operations, classical convolutional neural networks (CNNs), recurrent neural networks (LSTM), Deep Speech 2, and Transformer. Different from the existing end-to-end benchmarks which only present the training time, We try to investigate the impact of hardware, vendor’s software library, and deep learning framework on the performance and energy consumption of AI training. Our evaluation methods and results not only provide an informative guide for end users to select proper AI accelerators, but also expose some opportunities for the hardware vendors to improve their software library.
Index Terms: AI Accelerator | Deep Learning | CPU | GPU | TPU | Computation-intensive Operations | Convolution Neural Networks | Recurrent Neural Networks | Transformer | Deep Speech 2
مقاله انگلیسی
3 Training memristor-based multilayer neuromorphic networks with SGD, momentum and adaptive learning rates
آموزش شبکه های عصبی چند لایه مبتنی بر memristor با SGD ، سرعت یادگیری انطباقی و انعطاف پذیری-2020
Neural networks implemented with traditional hardware face inherent limitation of memory latency. Specifically, the processing units like GPUs, FPGAs, and customized ASICs, must wait for inputs to read from memory and outputs to write back. This motivates memristor-based neuromorphic computing in which the memory units (i.e., memristors) have computing capabilities. However, training a memristorbased neural network is difficult since memristors work differently from CMOS hardware. This paper proposes a new training approach that enables prevailing neural network training techniques to be applied for memristor-based neuromorphic networks. Particularly, we introduce momentum and adaptive learning rate to the circuit training, both of which are proven methods that significantly accelerate the convergence of neural network parameters. Furthermore, we show that this circuit can be used for neural networks with arbitrary numbers of layers, neurons, and parameters. Simulation results on four classification tasks demonstrate that the proposed circuit achieves both high accuracy and fast speed. Compared with the SGD-based training circuit, on the WBC data set, the training speed of our circuit is increased by 37.2% while the accuracy is only reduced by 0.77%. On the MNIST data set, the new circuit even leads to improved accuracy.
Keywords: Memristor | Neural network | Adaptive learning rate
مقاله انگلیسی
4 PowerCoord: Power capping coordination for multi-CPU/GPU servers using reinforcement learning
PowerCoord: هماهنگی محدود کردن قدرت برای سرورهای چند پردازنده CPU / GPU با استفاده از یادگیری تقویتی-2020
Modern supercomputers and cloud providers rely on server nodes that are equipped with multiple CPU sockets and general purpose GPUs (GPGPUs) to handle the high demand for intensive computations. These nodes consume much higher power than commodity servers, and integrating them with power capping systems used in modern clusters presents new challenges. In this paper, we propose a new power capping controller, PowerCoord, that is specifically designed for servers with multiple CPU and GPU sockets that are running multiple jobs at a time. PowerCoord coordinates among the various power domains (e.g., CPU sockets and GPUs) inside a node server to meet target power caps, while seeking to maximize throughput. Our approach also takes into consideration job deadlines and priorities. Because performance modeling for co-located jobs is error-prone, PowerCoord uses a learning method. PowerCoord has a number of heuristic policies to allocate power among the various CPUs and GPUs, and it uses reinforcement learning for policy selection during runtime. Based on the observed state of the system, PowerCoord shifts the distribution of selected policies. We implement our power cap controller on a real multi-CPU/GPU server with low overhead, and we demonstrate that it is able to meet target power caps while maximizing the throughput, and balancing other demands such as priorities and deadlines. Our results show PowerCoord improves the server throughput on average by 18% compared with the case when power is not coordinated among CPU/GPU domains. Also, PowerCoord improves the server throughput on average by 11% compared with prior work that uses a heuristic approach to coordinate the power among domains
Keywords : Power capping | GPGPU acceleration | Reinforcement learning
مقاله انگلیسی
5 Exploiting potential of deep neural networks by layer-wise fine-grained parallelism
بهره برداری از پتانسیل شبکه های عصبی عمیق با موازی سازی ریز دانه ای لایه ای خرد-2020
Deep neural networks (DNNs) have become more and more important for big data analysis. They usually use data parallelism or model parallelism for extreme scale computing. However, the two approaches realize the performance improvement mainly by using coarse-grained parallelization schemes. Neither can fully exploit the potentials of the parallelism of many-core systems (such as GPUs) for neural network models. Here, a new fine − grained parallelism strategy (named FiLayer) is presented based on layer-wise parallelization. It has two components: inter-layer parallelism and intralayer parallelism. The inter-layer parallelism makes several neighboring layers be processed by using a pipeline manner in a network model. For intra-layer parallelism, the operations in one layer are separated into several parts and processed concurrently. To implement above fine-grained parallelism methods, CUDA streams are used. A mathematical analysis is presented for the influence of fragment number on performance of the inter-layer parallelism, and also an analysis for the influence of CUDA stream number on the performance of the intra-layer parallelism is given. The proposed approach is realized based on Caffe. Some representative datasets including CIFAR100 and ImageNet, are applied for experiments. The evaluation results show that it can help Caffe realize remarkable speedups, which makes much sense to big data analysis.
Keywords: Deep learning | Fine-grained parallelism | CUDA stream
مقاله انگلیسی
6 A Survey and Taxonomy of FPGA-based Deep Learning Accelerators
مرور و طبقه بندی شتاب دهنده های یادگیری عمیق مبتنی بر FPGA-2019
Deep learning, the fastest growing segment of Artificial Neural Network (ANN), has led to the emergence of many machine learning applications and their implementation across multiple platforms such as CPUs, GPUs and recon- figurable hardware ( Field-Programmable Gate Arrays or FPGAs). However, inspired by the structure and function of ANNs, large-scale deep learning topologies require a considerable amount of parallel processing, memory re- sources, high throughput and significant processing power. Consequently, in the context of real time hardware systems, it is crucial to find the right trade-offbetween performance, energy efficiency, fast development, and cost. Although limited in size and resources, several approaches have showed that FPGAs provide a good starting point for the development of future deep learning implementation architectures. Through this paper, we briefly review recent work related to the implementation of deep learning algorithms in FPGAs. We will analyze and compare the design requirements and features of existing topologies to finally propose development strategies and implementation architectures for better use of FPGA-based deep learning topologies. In this context, we will examine the frameworks used in these studies, which will allow testing a lot of topologies to finally arrive at the best implementation alternatives in terms of performance and energy efficiency.
Keywords: Deep learning | Framework | Optimized implementation | FPGA
مقاله انگلیسی
7 A survey of techniques for optimizing deep learning on GPUs
مروری بر تکنیک های بهینه سازی یادگیری عمیق در GPU-2019
The rise of deep-learning (DL) has been fuelled by the improvements in accelerators. Due to its unique features, the GPU continues to remain the most widely used accelerator for DL applications. In this paper, we present a survey of architecture and system-level techniques for optimizing DL applications on GPUs. We review techniques for both inference and training and for both single GPU and distributed system with multiple GPUs. We bring out the similarities and differences of different works and highlight their key attributes. This survey will be useful for both novice and experts in the field of machine learning, processor architecture and high-performance computing
Keywords: Review | GPU | Hardware architecture for deep learning | Accelerator | Distributed training | Parameter | server Allreduce | Pruning | Tiling
مقاله انگلیسی
8 An efficient manifold regularized sparse non-negative matrix factorization model for large-scale recommender systems on GPUs
یک مدل فاکتور گیری ماتریس غیر منفی خلوت منظم شده چند ظرفیتی کارا برای سیستمهای توصیه گر در مقیاس بزرگ بر روی GPU-2019
Article history:Received 31 January 2018Revised 1 July 2018Accepted 25 July 2018Available online 27 July 2018Keywords:Collaborative filtering recommender systemsData miningEuclidean distance and KL-divergence GPU parallelizationManifold regularizationNon-negative matrix factorizationNon-negative Matrix Factorization (NMF) plays an important role in many data mining ap- plications for low-rank representation and analysis. Due to the sparsity that is caused by missing information in many high-dimension scenes, e.g., social networks or recommender systems, NMF cannot mine a more accurate representation from the explicit information. Manifold learning can incorporate the intrinsic geometry of the data, which is combined with a neighborhood with implicit information. Thus, manifold-regularized NMF (MNMF) can realize a more compact representation for the sparse data. However, MNMF suffers from (a) the forming of large-scale Laplacian matrices, (b) frequent large-scale matrix ma- nipulation, and (c) the involved K-nearest neighbor points, which will result in the over- writing problem in parallelization. To address these issues, a single-thread-based MNMF model is proposed on two types of divergence, i.e., Euclidean distance and Kullback–Leibler (KL) divergence, which depends only on the involved feature-tuples’ multiplication and summation and can avoid large-scale matrix manipulation. Furthermore, this model can remove the dependence among the feature vectors with fine-grain parallelization inher- ence. On that basis, a CUDA parallelization MNMF (CUMNMF) is presented on GPU com- puting. From the experimental results, CUMNMF achieves a 20X speedup compared with MNMF, as well as a lower time complexity and space requirement.© 2018 Published by Elsevier Inc.
Keywords: Collaborative filtering recommender systems | Data mining | Euclidean distance and KL-divergence | GPU parallelization | Manifold regularization | Non-negative matrix factorization
مقاله انگلیسی
9 Deep Learning with Microfluidics for Biotechnology
یادگیری عمیق با ریزگردها برای بیوتکنولوژی-2019
Advances in high-throughput and multiplexed microfluidics have rewarded biotechnology researchers with vast amounts of data but not necessarily the ability to analyze complex data effectively. Over the past few years, deep artificial neural networks (ANNs) leveraging modern graphics processing units (GPUs) have enabled the rapid analysis of structured input data – sequences, images, videos – to predict complex outputs with unprecedented accuracy. While there have been early successes in flow cytometry, for example, the extensive potential of pairing microfluidics (to acquire data) and deep learning (to analyze data) to tackle biotechnology challenges remains largely untapped. Here we provide a roadmap to integrating deep learning and microfluidics in biotechnology laboratories that matches computational architectures to problem types, and provide an outlook on emerging opportunities.
مقاله انگلیسی
10 Troodon: A machine-learning based load-balancing application scheduler for CPU–GPU system
Troodon: یک برنامه زمانبندی برنامه تعادل بار بر مبنای یادگیری ماشین برای سیستم CPU-GPU-2019
Heterogeneous computing machines consisting of a CPU and one or more GPUs are increasingly being used today because of their higher performance-cost ratio and lower energy consumption. To program such heterogeneous systems, OpenCL has become an industry standard due to the portability across various computing architectures. To exploit the computing capabilities of heterogeneous systems, application developers are porting their cluster and Cloud applications using OpenCL. With the increasing number of such applications, the use of shared accelerating computing devices (such as CPUs and GPUs) should be managed using an efficient load-balancing scheduling heuristic capable of reducing execution time, increasing throughput with high device utilization. Mostly, the OpenCL applications are suited (execute faster) on a specific computing device (CPU or GPU) and with varying data-sizes the speedup obtained by an application on the suitable device varies too. Applications’ mapping to computing devices without considering device suitability and obtainable speedup on a suitable device leads to sub-optimal execution time, lower throughput and load imbalance. Therefore, an application scheduler should consider both the device-suitability and speedup variation for scheduling decisions leading to a reduction in execution time and an increase in throughput. In this paper, we present a novel load-balancing scheduling heuristic named as Troodon that considers machinelearning based device-suitability model that classify OpenCL applications into either CPU suitable or GPU suitable. Moreover, a speedup predictor that predicts the amount of speedup that jobs will obtain when executed on a suitable device is also part of the Troodon. Troodon incorporates the E-OSched scheduling mechanism to map jobs on CPU and GPUs in a load balanced way. This results in reduced applications execution time, increased system throughput, and improved device utilization. We evaluate the proposed scheduler using a large number of data-parallel applications and compared with several other state-of-the-art scheduling heuristics. The experimental evaluation has demonstrated that the proposed scheduler outperformed the existing heuristics and reduced the application execution time up to 38% with increased system throughput and device utilization.
Keywords: Heterogeneous system | Scheduling | Device suitability | Load-balancing | Machine learning
مقاله انگلیسی
rss مقالات ترجمه شده rss مقالات انگلیسی rss کتاب های انگلیسی rss مقالات آموزشی
logo-samandehi
بازدید امروز: 9761 :::::::: بازدید دیروز: 0 :::::::: بازدید کل: 9761 :::::::: افراد آنلاین: 72