Exploiting Parallelism Opportunities with Deep Learning Frameworks

Authors:
Yu Emma Wang

Harvard University, Cambridge, MA

Harvard University, Cambridge, MA
View Profile

,
Carole-Jean Wu

Facebook AI

Facebook AI
View Profile

,
Xiaodong Wang

Facebook AI

Facebook AI
View Profile

,
Kim Hazelwood

Facebook AI

Facebook AI
View Profile

,
David Brooks

Harvard University, Cambridge, MA

Harvard University, Cambridge, MA
View Profile

ACM Transactions on Architecture and Code Optimization Volume 18 Issue 1Article No.: 9pp 1–23https://doi.org/10.1145/3431388

Published:30 December 2020Publication History

ACM Transactions on Architecture and Code Optimization

Abstract

State-of-the-art machine learning frameworks support a wide variety of design features to enable a flexible machine learning programming interface and to ease the programmability burden on machine learning developers. Identifying and using a performance-optimal setting in feature-rich frameworks, however, involves a non-trivial amount of performance profiling efforts and often relies on domain-specific knowledge. This article takes a deep dive into analyzing the performance impact of key design features in a machine learning framework and quantifies the role of parallelism. The observations and insights distill into a simple set of guidelines that one can use to achieve much higher training and inference speedup. Across a diverse set of real-world deep learning models, the evaluation results show that the proposed performance tuning guidelines outperform the Intel and TensorFlow recommended settings by 1.30× and 1.38×, respectively.

References

Facebook. 2019. Folly: Facebook Open-source Library. Retrieved from https://github.com/facebook/folGoogle Scholar
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI’16), Vol. 16. 265--283.Google ScholarDigital Library
P. Anju. 2018. Tips to improve performance for popular deep learning frameworks on CPUs. Intel Dev.Zone (2018). https://software.intel.com/content/www/us/en/develop/articles/tips-to-improve-performance-for-popular-deep-learning-frameworks-on-multi-core-cpus.html.Google Scholar
Soheil Bahrampour, Naveen Ramakrishnan, Lukas Schott, and Mohak Shah. 2015. Comparative study of Caffe, Neon, Theano, and Torch for deep learning. arXiv preprint arXiv:1511.06435.Google Scholar
Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Viral B. Shah. 2017. Julia: A fresh approach to numerical computing. SIAM Rev. 59, 1 (2017), 65--98.Google ScholarDigital Library
Ashraf Bhuiyan, Mahmoud Abuzaina, Niranjan Hasabnis, Niroop Ammbashankar, Faijul Amin, Sheng Fu, and Bhavani Subramanian. [n.d.]. Improving TensorFlow inference performance on Intel Xeon processors. Intel AI Blog ([n.d.]). intel.com. https://www.intel.com/content/www/us/en/artificial-intelligence/posts/improving-tensorflow-inference-performance-on-intel-xeon-processors.html.Google Scholar
Google AI Blog. 2019. Introducing GPipe, an open source library for efficiently training large-scale neural network models. Retrieved from https://ai.googleblog.com/2019/03/introducing-gpipe-open-source-library.html.Google Scholar
Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015).Google Scholar
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze et al. 2018. TVM: An Automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18). 578--594.Google Scholar
Heng-Tze Cheng. 2016. Wide and deep learning: Better together with TensorFlow. Google AI Blog. Retrieved from https://ai.googleblog.com/2016/06/wide-deep-learning-better-together-with.html.Google Scholar
Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for YouTube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 191--198.Google ScholarDigital Library
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248--255.Google ScholarCross Ref
Eigen. 2019. Eigen thread pool. (2019). Retrieved from https://bitbucket.org/eigen/eigen/src/default/unsupported/Eigen/CXX11/src/ThreadPool/.Google Scholar
Assaf Eisenman, Maxim Naumov, Darryl Gardner, Misha Smelyanskiy, Sergey Pupyrev, Kim Hazelwood, Asaf Cidon, and Sachin Katti. 2018. Bandana: Using non-volatile memory for storing deep learning models. arXiv preprint arXiv:1811.05922 (2018).Google Scholar
Google. 2019. TensorFlow Performance Guide. Retrieved from https://docs.w3cub.com/tensorflow~guide/performance/performance_guide/#general_best_practices.Google Scholar
Udit Gupta, Xiaodong Wang, Maxim Naumov, Carole-Jean Wu, Brandon Reagen, David Brooks, Bradford Cottel, Kim Hazelwood, Bill Jia, Hsien-Hsin S. Lee, Andrey Malevich, Dheevatsa Mudigere, Mikhail Smelyanskiy, Liang Xiong, and Xuan Zhang. 2019. The architectural implications of Facebook’s DNN-based personalized recommendation. arXiv preprint arXiv:1906.03109 (2019).Google Scholar
Niranjan Hasabnis. 2018. Auto-tuning TensorFlow threading model for CPU backend. arXiv preprint arXiv:1812.01665 (2018).Google Scholar
Kim Hazelwood, Sarah Bird, David Brooks, Soumith Chintala, Utku Diril, Dmytro Dzhulgakov, Mohamed Fawzy, Bill Jia, Yangqing Jia, Aditya Kalro, et al. 2018. Applied machine learning at Facebook: A datacenter infrastructure perspective. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’18). IEEE, 620--629.Google ScholarCross Ref
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google ScholarCross Ref
Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 173--182.Google ScholarDigital Library
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4700--4708.Google Scholar
Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size. arXiv preprint arXiv:1602.07360 (2016).Google Scholar
Arpan Jain, Ammar Ahmad Awan, Quentin Anthony, Hari Subramoni, and Dhableswar K. D. K. Panda. 2019. Performance characterization of DNN training using Tensorflow and PyTorch on modern clusters. In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER’19). IEEE, 1--11.Google Scholar
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia. ACM, 675--678.Google ScholarDigital Library
Nikhil Ketkar. 2017. Introduction to PyTorch. In Deep Learning with Python. Springer, 195--208.Google Scholar
Primate Labs. 2019. GeekBench v4. Retrieved from https://www.geekbench.com/.Google Scholar
Chris Lomont. 2011. Introduction to Intel advanced vector extensions. Intel White Paper (2011), 1--21. https://software.intel.com/content/dam/develop/external/us/en/documents/intro-to-intel-avx-183287.pdf.Google Scholar
P. Mattson, V. J. Reddi, C. Cheng, C. Coleman, G. Diamos, D. Kanter, P. Micikevicius, D. Patterson, G. Schmuelling, H. Tang, G. Wei, and Carole-Jean Wu. 2020. MLPerf: An industry standard benchmark suite for machine learning performance. IEEE Micro 40, 2 (2020), 8--16.Google ScholarCross Ref
Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G. Azzolini, Dmytro Dzhulgakov, Andrey Mallevich, Ilia Cherniavskii, Yinghai Lu, Raghuraman Krishnamoorthi, Ansha Yu, Volodymyr Kondratenko, Stephanie Pereira, Xianjie Chen, Wenlin Chen, Vijay Rao, Bill Jia, Liang Xiong, and Misha Smelyanskiy. 2019. Deep learning recommendation model for personalization and recommendation systems. arXiv preprint arXiv:1906.00091 (2019).Google Scholar
Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breughe, Mark Charlebois, William Chou, et al. 2019. MLPerf inference benchmark. arXiv preprint arXiv:1911.02549 (2019).Google Scholar
Shaohuai Shi, Qiang Wang, Pengfei Xu, and Xiaowen Chu. 2016. Benchmarking state-of-the-art deep learning software tools. In Proceedings of the 7th International Conference on Cloud Computing and Big Data (CCBD’16). IEEE, 99--104.Google ScholarCross Ref
Akshitha Sriraman, Abhishek Dhanotia, and Thomas F. Wenisch. 2019. SoftSKU: Optimizing server architectures for microservice diversity at scale. In Proceedings of the 46th International Symposium on Computer Architecture. ACM, 513--526.Google Scholar
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1--9.Google ScholarCross Ref
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2818--2826.Google ScholarCross Ref
Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S. Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. 2018. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. arXiv preprint arXiv:1802.04730 (2018).Google Scholar
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 5998--6008.Google Scholar
Carole-Jean Wu, David Brooks, Kevin Chen, Douglas Chen, Sy Choudhury, Marat Dukhan, Kim Hazelwood, Eldad Isaac, Yangqing Jia, Bill Jia, et al. 2019. Machine learning at Facebook: Understanding inference at the edge. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’19). IEEE, 331--344.Google ScholarCross Ref
Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1492--1500.Google ScholarCross Ref
Ahmad Yasin. 2014. A top-down method for performance analysis and counters architecture. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’14). IEEE, 35--44.Google ScholarCross Ref

Index Terms

Exploiting Parallelism Opportunities with Deep Learning Frameworks
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Neural networks
2. Computing methodologies
  1. Machine learning
    1. Machine learning approaches

Recommendations

Exploiting data-parallelism in gpus
Read More
Exploiting Parallelism on GPUs and FPGAs with OmpSs
ANDARE '17: Proceedings of the 1st Workshop on AutotuniNg and aDaptivity AppRoaches for Energy efficient HPC Systems

This paper presents the OmpSs approach to deal with heterogeneous programming on GPU and FPGA accelerators. The OmpSs programming model is based on the Mercurium compiler and the Nanos++ runtime. Applications are annotated with compiler directives ...
Read More
N-body computations using skeletal frameworks on multicore CPU/graphics processing unit architectures: an empirical performance evaluation

With the emergence of general-purpose computation on graphics processing units, high-level approaches that hide the conceptual complexity of the low-level Compute Unified Device Architecture and Open Computing Language platforms are the subject of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Architecture and Code Optimization Volume 18, Issue 1
March 2021
402 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3446348
Editor:
David Kaeli
Northeastern University, USA
Issue’s Table of Contents
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 December 2020
- Accepted: 1 October 2020
- Revised: 1 September 2020
- Received: 1 May 2020
Published in taco Volume 18, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Machine learning frameworks
parallel computing
performance analysis
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 14
  Total Citations
  View Citations
- 1,388
  Total Downloads
- Downloads (Last 12 months)332
- Downloads (Last 6 weeks)55
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Exploiting Parallelism Opportunities with Deep Learning Frameworks

ACM Transactions on Architecture and Code Optimization

Abstract

References

Cited By

Index Terms

Recommendations

Exploiting data-parallelism in gpus

Exploiting Parallelism on GPUs and FPGAs with OmpSs

N-body computations using skeletal frameworks on multicore CPU/graphics processing unit architectures: an empirical performance evaluation