Deep learning algorithms are the basis of many artificial intelligence applications.Those algorithms are both computationally intensive and memory intensive,making them difficult to deploy on embedded systems.Thus var...Deep learning algorithms are the basis of many artificial intelligence applications.Those algorithms are both computationally intensive and memory intensive,making them difficult to deploy on embedded systems.Thus various deep learning accelerators(DLAs)are proposed and applied to achieve better performance and lower power consumption.However,most deep learning accelerators are unable to support multiple data formats.This research proposes the MW-DLA,a deep learning accelerator supporting dynamic configurable data-width.This work analyzes the data distribution of different data types in different layers and trains a typical network with per-layer representation.As a result,the proposed MW-DLA achieves 2X performance and more than 50%memory requirement for AlexNet with less than 5.77%area overhead.展开更多
Recent years,neural networks(NNs)have received increasing attention from both academia and industry.So far significant diversity among existing NNs as well as their hardware platforms makes NN programming a daunting t...Recent years,neural networks(NNs)have received increasing attention from both academia and industry.So far significant diversity among existing NNs as well as their hardware platforms makes NN programming a daunting task.In this paper,a domain-specific language(DSL)for NNs,neural network language(NNL)is proposed to deliver productivity of NN programming and portable performance of NN execution on different hardware platforms.The productivity and flexibility of NN programming are enabled by abstracting NNs as a directed graph of blocks.The language describes 4 representative and widely used NNs and runs them on 3 different hardware platforms(CPU,GPU and NN accelerator).Experimental results show that NNs written with the proposed language are,on average,14.5%better than the baseline implementations across these 3 platforms.Moreover,compared with the Caffe framework that specifically targets the GPU platform,the code can achieve similar performance.展开更多
Deep learning accelerators(DLAs)have been proved to be efficient computational devices for processing deep learning algorithms.Various DLA architectures are proposed and applied to different applications and tasks.How...Deep learning accelerators(DLAs)have been proved to be efficient computational devices for processing deep learning algorithms.Various DLA architectures are proposed and applied to different applications and tasks.However,for most DLAs,their programming interfaces are either difficult to use or not efficient enough.Most DLAs require programmers to directly write instructions,which is time-consuming and error-prone.Another prevailing programming interface for DLAs is high-performance libraries and deep learning frameworks,which are easy to be used and very friendly to users,but their high abstraction level limits their control capacity over the hardware resources thus compromises the efficiency of the accelerator.A design of the programming interface is for DLAs.First various existing DLAs and their programming methods are analyzed and a methodology for designing programming interface for DLAs is proposed,which is a high-level assembly language(called DLA-AL),assembler and runtime for DLAs.DLA-AL is composed of a low-level assembly language and a set of high-level blocks.It allows experienced experts to fully exploit the potential of DLAs and achieve near-optimal performance.Meanwhile,by using DLA-AL,end-users who have little knowledge of the hardware are able to develop deep learning algorithms on DLAs spending minimal programming efforts.展开更多
Deep learning has now been widely used in intelligent apps of mobile devices.In pursuit of ultra-low power and latency,integrating neural network accelerators(NNA)to mobile phones has become a trend.However,convention...Deep learning has now been widely used in intelligent apps of mobile devices.In pursuit of ultra-low power and latency,integrating neural network accelerators(NNA)to mobile phones has become a trend.However,conventional deep learning programming frameworks are not well-developed to support such devices,leading to low computing efficiency and high memory-occupation.To address this problem,a 2-stage pipeline is proposed for optimizing deep learning model inference on mobile devices with NNAs in terms of both speed and memory-footprint.The 1 st stage reduces computation workload via graph optimization,including splitting and merging nodes.The 2 nd stage goes further by optimizing at compilation level,including kernel fusion and in-advance compilation.The proposed optimizations on a commercial mobile phone with an NNA is evaluated.The experimental results show that the proposed approaches achieve 2.8×to 26×speed up,and reduce the memory-footprint by up to 75%.展开更多
The increasing attention on deep learning has tremendously spurred the design of intelligence processing hardware. The variety of emerging intelligence processors requires standard benchmarks for fair comparison and s...The increasing attention on deep learning has tremendously spurred the design of intelligence processing hardware. The variety of emerging intelligence processors requires standard benchmarks for fair comparison and system optimization (in both software and hardware). However, existing benchmarks are unsuitable for benchmarking intelligence processors due to their non-diversity and nonrepresentativeness. Also, the lack of a standard benchmarking methodology further exacerbates this problem. In this paper, we propose BENCHIP, a benchmark suite and benchmarking methodology for intelligence processors. The benchmark suite in BENCHIP consists of two sets of benchmarks: microbenchmarks and macrobenchmarks. The microbenchmarks consist of single-layer networks, They are mainly designed for bottleneck analysis and system optimization. The macrobenchmarks contain state-of-the-art industrial networks, so as to offer a realistic comparison of different platforms. We also propose a standard benchmarking methodology built upon an industrial software stack and evaluation metrics that comprehensively reflect various characteristics of the evaluated intelligence processors, BENCHIP is utilized for evaluating various hardware platforms, including CPUs, GPUs, and accelerators. BENCHIP will be open-sourced soon.展开更多
基金the National Key Research and Development Program of China(No.2017YFA0700900,2017YFA0700902,2017YFA0700901,2017YFB1003101)the National Natural Science Foundation of China(No.61472396,61432016,61473275,61522211,61532016,61521092,61502446,61672491,61602441,61602446,61732002,61702478,61732020)+4 种基金Beijing Natural Science Foundation(No.JQ18013)the National Basic Research Program of China(No.2015CB358800)National Science and Technology Major Project(No.2018ZX01031102)the Transformation and Transfer of Scientific and Technological Achievements of Chinese Academy of Sciences(No.KFJ-HGZX-013)Strategic Priority Research Program of Chinese Academy of Science(No.XDB32050200).
文摘Deep learning algorithms are the basis of many artificial intelligence applications.Those algorithms are both computationally intensive and memory intensive,making them difficult to deploy on embedded systems.Thus various deep learning accelerators(DLAs)are proposed and applied to achieve better performance and lower power consumption.However,most deep learning accelerators are unable to support multiple data formats.This research proposes the MW-DLA,a deep learning accelerator supporting dynamic configurable data-width.This work analyzes the data distribution of different data types in different layers and trains a typical network with per-layer representation.As a result,the proposed MW-DLA achieves 2X performance and more than 50%memory requirement for AlexNet with less than 5.77%area overhead.
基金the National Key Research and Development Program of China(No.2017YFA0700902,2017YFB1003101)the National Natural Science Foundation of China(No.61472396,61432016,61473275,61522211,61532016,61521092,61502446,61672491,61602441,61602446,61732002,61702478)+3 种基金the 973 Program of China(No.2015CB358800)National Science and Technology Major Project(No.2018ZX01031102)the Transformation and Transfer of Scientific and Technological Achievements of Chinese Academy of Sciences(No.KFJ-HGZX-013)Strategic Priority Research Program of Chinese Academy of Sciences(No.XDBS01050200).
文摘Recent years,neural networks(NNs)have received increasing attention from both academia and industry.So far significant diversity among existing NNs as well as their hardware platforms makes NN programming a daunting task.In this paper,a domain-specific language(DSL)for NNs,neural network language(NNL)is proposed to deliver productivity of NN programming and portable performance of NN execution on different hardware platforms.The productivity and flexibility of NN programming are enabled by abstracting NNs as a directed graph of blocks.The language describes 4 representative and widely used NNs and runs them on 3 different hardware platforms(CPU,GPU and NN accelerator).Experimental results show that NNs written with the proposed language are,on average,14.5%better than the baseline implementations across these 3 platforms.Moreover,compared with the Caffe framework that specifically targets the GPU platform,the code can achieve similar performance.
基金Supported by the National Key Research and Development Program of China(No.2017YFA0700902,2017YFB1003101)the 973 Program of China(No.2015CB358800)National Science and Technology Major Project(No.2018ZX01031102)
文摘Deep learning accelerators(DLAs)have been proved to be efficient computational devices for processing deep learning algorithms.Various DLA architectures are proposed and applied to different applications and tasks.However,for most DLAs,their programming interfaces are either difficult to use or not efficient enough.Most DLAs require programmers to directly write instructions,which is time-consuming and error-prone.Another prevailing programming interface for DLAs is high-performance libraries and deep learning frameworks,which are easy to be used and very friendly to users,but their high abstraction level limits their control capacity over the hardware resources thus compromises the efficiency of the accelerator.A design of the programming interface is for DLAs.First various existing DLAs and their programming methods are analyzed and a methodology for designing programming interface for DLAs is proposed,which is a high-level assembly language(called DLA-AL),assembler and runtime for DLAs.DLA-AL is composed of a low-level assembly language and a set of high-level blocks.It allows experienced experts to fully exploit the potential of DLAs and achieve near-optimal performance.Meanwhile,by using DLA-AL,end-users who have little knowledge of the hardware are able to develop deep learning algorithms on DLAs spending minimal programming efforts.
基金Supported by the National Key Research and Development Program of China(No.2017YFB1003101,2018AAA0103300,2017YFA0700900)the National Natural Science Foundation of China(No.61702478,61732007,61906179)+2 种基金the Beijing Natural Science Foundation(No.JQ18013)the National Science and Technology Major Project(No.2018ZX01031102)the Beijing Academy of Artificial Intelligence
文摘Deep learning has now been widely used in intelligent apps of mobile devices.In pursuit of ultra-low power and latency,integrating neural network accelerators(NNA)to mobile phones has become a trend.However,conventional deep learning programming frameworks are not well-developed to support such devices,leading to low computing efficiency and high memory-occupation.To address this problem,a 2-stage pipeline is proposed for optimizing deep learning model inference on mobile devices with NNAs in terms of both speed and memory-footprint.The 1 st stage reduces computation workload via graph optimization,including splitting and merging nodes.The 2 nd stage goes further by optimizing at compilation level,including kernel fusion and in-advance compilation.The proposed optimizations on a commercial mobile phone with an NNA is evaluated.The experimental results show that the proposed approaches achieve 2.8×to 26×speed up,and reduce the memory-footprint by up to 75%.
基金This work is partially supported by the National Key Research and Development Program of China under Grant No. 2017YFB1003101, the National Natural Science Foundation of China under Grant Nos. 61472396, 61432016, 61473275, 61522211, 61532016, 61521092, 61502446, 61672491, 61602441, 61602446, 61732002, and 61702478, Beijing Science and Technology Projects under Grant No. Z151100000915072, the Science and Technology Service Network Initiative (STS) Projects of Chinese Academy of Sciences, and the National Basic Research 973 Program of China under Grant No. 2015CB358800.
文摘The increasing attention on deep learning has tremendously spurred the design of intelligence processing hardware. The variety of emerging intelligence processors requires standard benchmarks for fair comparison and system optimization (in both software and hardware). However, existing benchmarks are unsuitable for benchmarking intelligence processors due to their non-diversity and nonrepresentativeness. Also, the lack of a standard benchmarking methodology further exacerbates this problem. In this paper, we propose BENCHIP, a benchmark suite and benchmarking methodology for intelligence processors. The benchmark suite in BENCHIP consists of two sets of benchmarks: microbenchmarks and macrobenchmarks. The microbenchmarks consist of single-layer networks, They are mainly designed for bottleneck analysis and system optimization. The macrobenchmarks contain state-of-the-art industrial networks, so as to offer a realistic comparison of different platforms. We also propose a standard benchmarking methodology built upon an industrial software stack and evaluation metrics that comprehensively reflect various characteristics of the evaluated intelligence processors, BENCHIP is utilized for evaluating various hardware platforms, including CPUs, GPUs, and accelerators. BENCHIP will be open-sourced soon.