Automatic generation of specialized direct convolutions for mobile GPUs.

Using Lift, we show that it is possible to generate automatically code that is ×10 faster than the direct convolution while using ×3.6 less space than the GEMM-based convolution of the very specialized ... ARM Compute Library on the latest generation of ARM Mali GPU. ... Acknowledgments This work was supported by the Engineering and Physical Sciences Research Council (grant EP/L01503X/1), EPSRC Centre for Doctoral Training in Pervasive Parallelism at the University of ...

doi:10.1145/3366428.3380771 dblp:conf/ppopp/MogersRLTOD20 fatcat:342savoeijb3zaznujfmhptoku

We propose design automation techniques for efficient neural networks. We investigate automatically designing specialized fast models, auto channel pruning, and auto mixed-precision quantization. ... Moreover, we shorten the design cycle by 200x than previous work, so that we can afford to design specialized neural network models for different hardware platforms. ... Compared general purpose models, our specialized model improves the top-1 accuracy by 1.1% -3.1% while being 1.2×-7.5× faster. Table 2 compares the specialized models on CPU/GPU/Mobile. ...

arXiv:1904.10616v1 fatcat:77ft4alwqvgszhevtcjssnkyzm

We experimentally demonstrate the applicability of such an approach on a subset of the popular NAS-Benchmark 101 dataset for two different mobile GPU. ... A lot of deep learning applications are desired to be run on mobile devices. Both accuracy and inference time are meaningful for a lot of them. ... Acknowledgments: We thank the editor and three anonymous reviewers for their constructive comments, which helped us to improve the manuscript. ...

doi:10.3390/computers10080104 fatcat:4titj4ftlvfdxkcrdgn3td7um4

DOAJ

DL architectures and algorithms are hardly adapted to the storage and computation resources of a mobile device. erefore, there is a need for new generations of mobile processors and chipsets, small footprint ... Speci cally, in recent years powerful and compact GPUs have been released at a ordable prices, which allow accelerating the computation of the weights of DNNs. ... special hardware platforms for mobile DNNs. ...

doi:10.1145/3092831 fatcat:ez2fcgckhjawlfywyecest4jqy

Mobile and embedded systems now have GPUs which are ideal for the parallel computations of neural networks and for their lower energy cost per operation. ... We show that a reduction in the number of convolutional channels, pruning 12% of the initial size, is in some cases detrimental to performance, leading to 2⇥ slowdown. ... Second, designing new neural network architectures for specific devices should consider the best sizes of convolutional layers for each library and hardware, thus building specialized networks for each ...

doi:10.1109/iiswc47752.2019.9042000 dblp:conf/iiswc/RaduKWTCCFSO19 fatcat:hvo6ll2esndyzg7sfnv5ujbe2u

Mobile and embedded systems now have GPUs which are ideal for the parallel computations of neural networks and for their lower energy cost per operation. ... We show that a reduction in the number of convolutional channels, pruning 12% of the initial size, is in some cases detrimental to performance, leading to 2x slowdown. ... Second, designing new neural network architectures for specific devices should consider the best sizes of convolutional layers for each library and hardware, thus building specialized networks for each ...

arXiv:2002.08697v1 fatcat:eii47oyijfgkbfuivmmfx2xmd4

Apart from general acceleration techniques, we also showcase several task-specific accelerations for point cloud, video, and natural language processing by exploiting their spatial sparsity and temporal ... To reduce the large design cost of these manual solutions, we discuss the AutoML framework for each of them, such as neural architecture search (NAS) and automated pruning and quantization. ... In practice, depthwise convolution is usually used for edge devices (e.g., mobile), while group/normal convolution is usually used for cloud devices (e.g., GPU). ...

doi:10.1145/3486618 fatcat:h6xwv2slo5eklift2fl24usine

Multiple Versions

The goal of this paper is to provide insights and understanding of HW-NAS techniques for various hardware platforms (MCU, CPU, GPU, ASIC, FPGA, ReRAM, DSP, and VPU), followed by the co-search methodologies ... At the same time, several hardware platforms, general- and special-purpose, have equally contributed to the training and deployment of these complex networks in a different setting. ... Any opinions, indings, and conclusions or recommendations expressed in this material are those of the author(s). ...

doi:10.1145/3524500 fatcat:4ibnwmgbdnbhjpk4u7soc6aom4

We study performance characteristics of convolutional neural networks (CNN) for mobile computer vision systems. CNNs have proven to be a powerful and efficient approach to implement such systems. ... Our measurements include embedded processors found on mobile devices and high-performance processors that can be used on the network side of mobile systems. ... Some research prototypes that leverage mobile device special purpose processors (e.g., DSP, GPU) also exist [13, [15] [16] [17] [18] . ...

arXiv:1803.09492v1 fatcat:akf3qn7p5vdtxppjoitajrg6ri

However, on other hardware targets, especially mobile GPUs, such vendor libraries are not generally available. ... Thus, the development of portable, open, high-performance, energy-efficient GPU code for DNN operations would enable broader deployment of DNN-based algorithms. ... Tuning for Qualcomm Mobile GPUs . In Figure 11 , the bodainitial values show the initial (poor) performance when running the general-case fallback convolution variant on the SD820 platform. ...

arXiv:1611.06945v1 fatcat:clgpegm2ubd6lowwclnheqjf7q

Computer vision models and particularly deep directed acyclic graphs based on convolutional modules are generally constructed and trained based on natural images datasets. ... In the current paper, we will present the base principles of a deep neural pipeline for computer vision applied to artificial scenes (scenes generated by user interfaces or similar). ... for automatic code generation based on (near) natural language specifications up to source code generation based on an interface mock-up (computer-aided drawing of user-interface mock-up). ...

arXiv:1911.01346v1 fatcat:zmbuoiwnrfcsvbhbqkson4nkxy

Open Access Multiple Versions

However, the direct generalization of existing 2D CNN weight pruning methods to 3D CNNs is not ideal for fully exploiting mobile parallelism while achieving high inference accuracy. ... Mobile devices are becoming an important carrier for deep learning tasks, as they are being equipped with powerful, high-end mobile CPUs and GPUs. ... Consider a general 3D CNN consisting of L convolutional (CONV) layers. Besides the l-th CONV layer weight tensor W l , the bias is denoted by b l . ...

arXiv:2007.09835v2 fatcat:qsyhrk6hhvcjfc2tcxyxoqupya

Multiple Versions

Based on a series of controlled experiments, this work derives several practical guidelines for efficient network design. Accordingly, a new architecture is presented, called ShuffleNet V2. ... Thus, this work proposes to evaluate the direct metric on the target platform, beyond only considering FLOPs. ... Acknowledgements Thanks Yichen Wei for his help on paper writing. This research is partially supported by National Natural Science Foundation of China (Grant No. 61773229). ...

doi:10.1007/978-3-030-01264-9_8 fatcat:5eljnbtc65blveoza4nm5k6gbi

To deal with these challenges, we propose Mobile Neural Network (MNN), a universal and efficient inference engine tailored to mobile applications. ... Deploying deep learning models on mobile devices draws more and more attention recently. ... ACKNOWLEDGEMENTS We thank Chaoyue Niu for helpful discussions and the anonymous reviewers for their valuable comments to improve our work. ...

arXiv:2002.12418v1 fatcat:ppeykiv57nc6bfqa74lyzse3by

In recent years, convolutional networks have demonstrated unprecedented performance in the image restoration task of super-resolution (SR). ... SR entails the upscaling of a single low-resolution image in order to meet application-specific image quality demands and plays a key role in mobile devices. ... In general, CE can represent a diversity of mobile SoCs hosting heterogeneous compute engines, ranging from the ubiquitous mobile CPUs and GPUs to the newer emerging NPUs [26] . ...

doi:10.1145/3300061.3345455 dblp:conf/mobicom/LeeVDBL19 fatcat:k52pugz3tvc3jjky3cmb4d3t7m

Multiple Versions

Automatic generation of specialized direct convolutions for mobile GPUs

Preserved Fulltext

Design Automation for Efficient Deep Learning Computing [article]

Preserved Fulltext

Latency Estimation Tool and Investigation of Neural Networks Inference on Mobile GPU

Preserved Fulltext

Deep Learning for Mobile Multimedia

Preserved Fulltext

Performance Aware Convolutional Neural Network Channel Pruning for Embedded GPUs

Preserved Fulltext

Performance Aware Convolutional Neural Network Channel Pruning for Embedded GPUs [article]

Preserved Fulltext

Enable Deep Learning on Mobile Devices: Methods, Systems, and Applications

Preserved Fulltext

Neural Architecture Search Survey: A Hardware Perspective

Preserved Fulltext

Latency and Throughput Characterization of Convolutional Neural Networks for Mobile Computer Vision [article]

Preserved Fulltext

A Metaprogramming and Autotuning Framework for Deploying Deep Learning Applications [article]

Preserved Fulltext

CloudifierNet – Deep Vision Models for Artificial Image Processing [article]

Preserved Fulltext

Other Versions

RT3D: Achieving Real-Time Execution of 3D Convolutional Neural Networks on Mobile Devices [article]

Preserved Fulltext

ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design [chapter]

Preserved Fulltext

MNN: A Universal and Efficient Inference Engine [article]

Preserved Fulltext

MobiSR

Preserved Fulltext