از نرم افزار winrar برای باز کردن فایل های فشرده استفاده می شود. برای دانلود آن بر روی لینک زیر کلیک کنید
ViPS: A novel visual processing system architecture for medical imaging
ViPS : یک معماری سیستم پردازش بصری جدیدی را برای تصویربرداری پزشکی-2017
Article history:Received 10 June 2016Received in revised form 27 January 2017 Accepted 4 June 2017Keywords:Embedded computer visionHigh performance image processing systemsImaging has become an essential tool in modern medicine science. Numerous powerful platforms to register, store, analyze and process medical imaging applications appear in recent years. In this arti- cle, we have designed an advanced visual processing system (ViPS) that stores and processes complex and multi-dimensional medical imaging application. The ViPS provides a user-friendly programming environment and high-performance architecture for data acquisition, registration, storage, analysis and performs segmentation, ﬁltering, and recognition of complex real-time complex and multidimensional medical images or videos. The proposed architecture is highly reliable concerning cost, performance, and power. ViPS is designed and evaluated on a Xilinx Virtex-7 FPGA VC707 Evaluation Kit. The performance of ViPS is compared with the Heterogeneous Multi-Processing Odroid XU3 board and GPU Jetson TK1 Embedded Development Kit with 192 CUDA cores based graphic systems. In comparison to the Hetero- geneous Multi-core and GPU based Graphics systems, the results show that ViPS improves 2.4 and 1.4 times system performance respectively, for iridology application. While executing real-time complex images reconstruction at 2× and 1.25× of higher frame rate, the ViPS achieves 15.2× and 5.26× of per- formance improvement while running various image processing algorithms. The ViPS gets 3.01× and 1.13× of speedup for video processing algorithms and draws 1.55 and 2.27 times less dynamic power.© 2017 Elsevier Ltd. All rights reserved.
Keywords : Embedded computer vision | High performance image processing | systems
Distributed training strategies for a computer vision deep learning algorithm on a distributed GPU cluster
استراتژی های آموزشی توزیع شده برای بینایی ماشین طی الگوریتم یادگیری عمق بر روی یک خوشه GPU توزیع شده-2017
Deep learning algorithms base their success on building high learning capacity models with millions of parameters that are tuned in a data-driven fashion. These models are trained by processing millions of examples, so that the development of more accurate algorithms is usually limited by the throughput of the computing devices on which they are trained. In this work, we explore how the training of a state-of-the-art neural network for computer vision can be parallelized on a distributed GPU cluster. The effect of distributing the training process is addressed from two different points of view. First, the scalability of the task and its performance in the distributed setting are analyzed. Second, the impact of distributed training methods on the final accuracy of the models is studied.© 2017 The Authors. Published by Elsevier B.V.Peer-review under responsibility of the scientific committee of the International Conference on Computational Science
Keywords: distributed computing | parallel systems | deep learning | Convolutional Neural Networks
GPU-based parallel optimization of immune convolutional neural network and embedded system
بهینه سازی موازی بر اساس GPU از شبکه عصبی کانولوشن ایمنی و سیستم جاسازی شده-2017
Up to now, the image recognition system has been utilized more and more widely in the security monitoring, the industrial intelligent monitoring, the unmanned vehicle, and even the space exploration. In designing the image recognition system, the traditional convolutional neural network has some de- fects such as long training time, easy over-ﬁtting and high misclassiﬁcation rate. In order to overcome these defects, we ﬁrstly used the immune mechanism to improve the convolutional neural network and put forward a novel immune convolutional neural network algorithm, after we analyzed the network structure and parameters of the convolutional neural network. Our algorithm not only integrated the location data of the network nodes and the adjustable parameters, but also dynamically adjusted the smoothing factor of the basis function. In addition, we utilized the NVIDIA GPU (Graphics Processing Unit) to accelerate the new immune convolutional neural network (ICNN) in parallel computing and built a real-time embedded image recognition system for this ICNN. The immune convolutional neural net- work algorithm was improved with CUDA programming and was tested with the sample data in the GPU-based environment. The GPU-based implementation of the novel immune convolutional neural network algorithm was made with the cuDNN, which was designed by NVIDIA for GPU-based accel- erating of DNNs in machine learning. Experimental results show that our new immune convolutional neural network has higher recognition rate, more stable performance and faster computing speed than the traditional convolutional neural network.& 2016 Elsevier Ltd. All rights reserved.
Keywords:Immune algorithm | Convolutionalneuralnetwork | Image recognition | Parallelcomputing | Embedded system | Security monitoring
Analysis on spatial-temporal features of taxis emissions from big data informed travel patterns: a case of Shanghai, China
تجزیه و تحلیل در مورد ویژگی های زمانی- مکانی حرکت تاکسی ها از الگوهای سفر آگاه داده های بزرگ : یک مورد از شانگهای، چین-2017
Air pollutions from transportation sector have become a serious urban environmental problem, espe cially in developing countries with expending urbanization. Cleaner technologies advancement and optimal regulation on the transporting behaviors and related design in infrastructures is critical to address above issue. To understand the spatial and temporal emissions pattern within transportation lays the foundation for design on better infrastructures and guidance on low-carbon transportation behav iors. The feasibility of Global Positioning System (GPS) and emerging big data analysis technique enable the in-depth analysis on this topic, while to date, applications had been rather few. With this circum stance, this paper analyzed the taxis energy consumption and emissions and their spatial-temporal distribution in Shanghai, one of the most famous mega cities in China, applying big data analysis on GPS data of taxies. Spatial and temporal features of energy consumptions and pollutants emissions were further mapped with geographical information system (GIS). Results highlighted that, spatially, the energy consumption and emission presented a distribution of dual-core cyclic structure, in which, two hubs were identified. One was the city center, the other was Hongqiao transport hub, the activities and emission was more concentrated in the west par of Huangpu River. Temporally, the highest activity and emission moment was 9e10AM, the second peak occurred in 7e8PM, which were both the traffic rush period. The lowest activity/emission moment was 3e4AM. Causal mechanism for such distribution was further investigated, so as to improve the driving behaviors. Through the exploration of spatial and temporal emissions distribution of taxis via big dada technique, this paper provided enlightening in sights to policy makers for better understanding on the travel patterns and related environmental im plications in Shanghai metropolis, so as to support better planning on infrastructures system, demand side management and the promotion on low-carbon life styles.
Keywords:GPS|Big data mining|Spatial-temporal emissions distribution|Taxi travel pattern|Shanghai
Marcher: A Heterogeneous System Supporting Energy-Aware High Performance Computing and Big Data Analytics
راه پیما: یک سیستم ناهمگن حمایتی انرژی آگاه محاسبات با کارایی بالا و تحلیل داده های بزرگ-2017
Excessive energy consumption is a major constraint in designing and deploying the next generation of supercomputers. Minimizing energy consumption of high performance computing and big data applications requires novel energy-conscious technologies (both hardware and software) at multiple layers from architecture, system support, and applications. In the past decade, we have witnessed the significant progress toward developing more energy-efficient hardware and facility infrastructure. However, the energy efficiency of software has not been improved much. One obstacle that hinders the exploration of green software technologies is the lack of tools and systems that can provide accurate, fine-grained, and real-time power and energy measurement for technology evaluation and verification. Marcher, a heterogeneous high performance computing infrastructure, is built to fill the gap by providing support to research in energy-aware high performance computing and big data analytics. The Marcher system is equipped with Intel Xeon CPUs, Intel Many Integrated Cores (Xeon Phi), Nvidia GPUs, power aware memory systems and hybrid storage with Hard Disk Drives (HDDs) and Solid State Disks (SSDs). It provides easy-to-use tools and interfaces for researchers to obtain decomposed and fine-grained power consumption data of these primary computing components. This paper presents the design of the Marcher system and demonstrates the usage of Marcher power measurement tools to obtain detailed power consumption data in various research projects.
Keywords:Energy efficient high performance|computing|Energy-aware big data analytics|Power-measurable systems|Power profiling
Efficient deep network for vision-based object detection in robotic applications
شبکه عمیق کارآمد برای شناسایی شیء مبتنی بر دید در برنامه های کاربردی رباتیک-2017
Article history:Received 17 June 2016Revised 11 February 2017Accepted 17 March 2017Available online 24 March 2017 Communicated by Dr Zhang ZhaoxiangKeywords:Deep network Object detection Computer vision Robotic application MPGAVision-based object detection is essential for a multitude of robotic applications. However, it is also a challenging job due to the diversity of the environments in which such applications are required to oper- ate, and the strict constraints that apply to many robot systems in terms of run-time, power and space. To meet these special requirements of robotic applications, we propose an eﬃcient deep network for vision-based object detection. More speciﬁcally, for a given image captured by a robot mount camera, we ﬁrst introduce a novel proposal layer to eﬃciently generate potential object bounding-boxes. The pro- posal layer consists of eﬃcient on-line convolutions and effective off-line optimization. Afterwards, we construct a robust detection layer which contains a multiple population genetic algorithm-based con- volutional neural network (MPGA-based CNN) module and a TLD-based multi-frame fusion procedure. Unlike most deep learning based approaches, which rely on GPU, all of the on-line processes in our sys- tem are able to run eﬃciently without GPU support. We perform several experiments to validate each component of our proposed object detection approach and compare the approach with some recently published state-of-the-art object detection algorithms on widely used datasets. The experimental results demonstrate that the proposed network exhibits high eﬃciency and robustness in object detection tasks.© 2017 Elsevier B.V. All rights reserved.
Keywords: Deep network | Object detection | Computer vision | Robotic application | MPGA
Prototyping a GPGPU Neural Network for Deep-Learning Big Data Analysis
نمونه سازی شبکه عصبی GPGPU برای یادگیری عمیق تجزیه و تحلیل داده های بزرگ-2017
Big Data concerns with large-volume complex growing data. Given the fast development of data storage and network, organizations are collecting large ever-growing datasets that can have useful information. In order to extract information from these datasets within useful time, it is important to use distributed and parallel algorithms. One common usage of big data is machine learning, in which collected data is used to predict future behavior. Deep-Learning using Artificial Neural Networks is one of the popular methods for extracting information from complex datasets. Deep-learning is capable of more creating complex models than traditional probabilistic machine learning techniques. This work presents a step-by-step guide on how to prototype a Deep-Learning application that executes both on GPU and CPU clusters. Python and Redis are the core supporting tools of this guide. This tutorial will allow the reader to understand the basics of building a distributed high performance GPU application in a few hours. Since we do not depend on any deep-learning application or framework—we use low-level building blocks—this tutorial can be adjusted for any other parallel algorithm the reader might want to prototype on Big Data. Finally, we will discuss how to move from a prototype to a fully blown production application.
طراحی و استفاده از الگوریتم تطبیق استریو موازی بر اساس CUDA
سال انتشار: 2016 - تعداد صفحات فایل pdf انگلیسی: 9 - تعداد صفحات فایل doc فارسی: 30
برای ساخت دقیق اطلاعات توپوگرافی به صورت بلادرنگ از یک ربات متحرک شش پا؛ این مطالعه یک الگوریتم تطبیق استریو را پیشنهاد داده است که میتواند تفاوتهای کلی پیکسلها را با استفاده از مدل احتمال پسین بیز بر اساس پردازش موازی تسریع شده با GPU انجام دهند. الگوریتم پیشنهادی، پشتیبانی از نقاط فضای اختلاف را برای کسب شدت احتمال توزیع پسین از پیکسل ایجاد میکند و سپس آن را در یک مدل احتمال پسین بیز برای ایجاد تابع انرژی اختلاف مکانی جایگزین میکند. مقدار اختلاف مکانی پیسکل ناشناخته میتواند با حداقل سازی تابع انرژی حاصل شود. با اجرای یک بررسی سازگار بر تصویر سمت چپ و راست، عدم تطبیق پیکسل میتواند حذف شود. برطبق مقدار اختلاف مکانی نقطه پشتیبان، پرکردن اختلاف مکانی ناحیه نامتناسب میتواند با استفاده از روش وزن تطبیقی بر مبنای گسترش متقابل برای کسب چگالی دقیقی از نگاشت اختلاف مکانی بدست آید. محاسبات موازی در هر مرحله از الگوریتم پیشنهادی با استفاده از معماری یکپارچه دستگاهها برای کاهش زمان اجرا، استفاده میشود. نتایج آزمایشی نشان میدهند که الگوریتم پیشنهادی مقاومت خوبی نسبت به نورپردازیهای متفاوت و بازسازی سطح منحنی بافت دارد. الگوریتم میتواند برای تطبیق سریع تصاویر در اندازههای مختلف و بازسازی نگاشت اختلاف مکانی صحنهها به صورت بلادرتگ تحت نسبت رزلوشن 640*480 استفاده شود. مجموعه تست بینایی با بزرگ نمایی برای ایجاد نگاشت اختلاف مکانی صحنههای بزرگ و اعتبارسنجی اثرات کاربردی عملی الگوریتم استفاده شدند. نتایج خوبی از این آزمایش حاصل شد.
کلمات کلیدی : CUDA | پردازش موازی | تطبیق استریو | دید دو چشمی | بیز | وزن تطبیقی
|مقاله ترجمه شده|
SIMD parallel MCMC sampling with applications for big-data Bayesian analytics
SIMD نمونه گیری MCMC موازی با برنامه های کاربردی برای تجزیه و تحلیل بیزی داده های بزرگ-2015
Computational intensity and sequential nature of estimation techniques for Bayesian methods in statistics and machine learning, combined with their increasing applications for big data analytics, necessitate both the identification of potential opportunities to parallelize techniques such as Monte Carlo Markov Chain (MCMC) sampling, and the development of general strategies for mapping such parallel algorithms to modern CPUs in order to elicit the performance up the compute-based and/or memory-based hardware limits. Two opportunities for Single-Instruction Multiple-Data (SIMD) parallelization of MCMC sampling for probabilistic graphical models are presented. In exchangeable models with many observations such as Bayesian Generalized Linear Models (GLMs), child-node contributions to the conditional posterior of each node can be calculated concurrently. In undirected graphs with discrete-value nodes, concurrent sampling of conditionallyindependent nodes can be transformed into a SIMD form. High-performance libraries with multi-threading and vectorization capabilities can be readily applied to such SIMD opportunities to gain decent speedup, while a series of high-level source-code and runtime modifications provide further performance boost by reducing parallelization overhead and increasing data locality for Non-Uniform Memory Access architectures. For big-data Bayesian GLM graphs, the end-result is a routine for evaluating the conditional posterior and its gradient vector that is 5 times faster than a naive implementation using (built-in) multi-threaded Intel MKL BLAS, and reaches within the striking distance of the memorybandwidth-induced hardware limit. Using multi-threading for cache-friendly, fine-grained parallelization can outperform coarse-grained alternatives which are often less cachefriendly, a likely scenario in modern predictive analytics workflow such as Hierarchical Bayesian GLM, variable selection, and ensemble regression and classification. The proposed optimization strategies improve the scaling of performance with number of cores and width of vector units (applicable to many-core SIMD processors such as Intel Xeon Phi and Graphic Processing Units), resulting in cost-effectiveness, energy efficiency (‘green computing’), and higher speed on multi-core x86 processors. Keywords: GPU Hierarchical Bayesian Intel Xeon Phi Logistic regression OpenMP Vectorization
An MPI–CUDA library for image processing on HPC architectures
کتابخانه MPI-CUDA برای پردازش تصویر در معماری HPC-2015
Scientific image processing is a topic of interest for a broad scientific community since it is a mean of gaining understanding and insight into the data for a growing number of applications. Furthermore, the technological evolution permits large data acquisition, with sophisticated instruments, and their elaboration through complex multidisciplinary applications, resulting in datasets that are growing at an extremely rapid pace. This results in the need of huge computational power for the processing. It is necessary to move towards High Performance Computing (HPC) and to develop proper parallel implementations of image processing algorithms/operations. Modern HPC resources are typically highly heterogeneous systems, composed of multiple CPUs and accelerators such as Graphics Processing Units (GPUs) and Field-Programmable Gate Arrays (FPGAs). The actual barrier posed by heterogeneous HPC resources is the development and/or the performance efficient porting of software on such complex architectures. In this context, the aim of this work is to enable image processing on cluster of GPUs, through the use of PIMA(GE) 2 Lib, the Parallel IMAGE processing GEnoa Library. The library is able to exploit traditional clusters through MPI, GPU device through CUDA and a first experimentation is aimed to explore the use of GPU-clusters. Library operations are provided to the users through a sequential interface defined to hide the parallelism of the computation. The parallel computation, at each level, is managed employing specific policies designed to suitably coordinate the parallel processes/threads involved in the elaboration and their use is tightly coupled with the PIMA(GE) 2 Lib interface. In this paper, we present the incremental approach adopted in the development of the library and the performance gains in each implementations: quite linear speedup is achieved on cluster architecture, about a 30% improvement in the execution time on a single GPU and the first results on cluster of GPUs are promising. Keywords: Image processing | Parallel computing | GPU