Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open Access

Exploiting Parallelism Opportunities with Deep Learning Frameworks

Published:30 December 2020Publication History
Skip Abstract Section

Abstract

State-of-the-art machine learning frameworks support a wide variety of design features to enable a flexible machine learning programming interface and to ease the programmability burden on machine learning developers. Identifying and using a performance-optimal setting in feature-rich frameworks, however, involves a non-trivial amount of performance profiling efforts and often relies on domain-specific knowledge. This article takes a deep dive into analyzing the performance impact of key design features in a machine learning framework and quantifies the role of parallelism. The observations and insights distill into a simple set of guidelines that one can use to achieve much higher training and inference speedup. Across a diverse set of real-world deep learning models, the evaluation results show that the proposed performance tuning guidelines outperform the Intel and TensorFlow recommended settings by 1.30× and 1.38×, respectively.

References

  1. Facebook. 2019. Folly: Facebook Open-source Library. Retrieved from https://github.com/facebook/folGoogle ScholarGoogle Scholar
  2. Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI’16), Vol. 16. 265--283.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. Anju. 2018. Tips to improve performance for popular deep learning frameworks on CPUs. Intel Dev.Zone (2018). https://software.intel.com/content/www/us/en/develop/articles/tips-to-improve-performance-for-popular-deep-learning-frameworks-on-multi-core-cpus.html.Google ScholarGoogle Scholar
  4. Soheil Bahrampour, Naveen Ramakrishnan, Lukas Schott, and Mohak Shah. 2015. Comparative study of Caffe, Neon, Theano, and Torch for deep learning. arXiv preprint arXiv:1511.06435.Google ScholarGoogle Scholar
  5. Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Viral B. Shah. 2017. Julia: A fresh approach to numerical computing. SIAM Rev. 59, 1 (2017), 65--98.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Ashraf Bhuiyan, Mahmoud Abuzaina, Niranjan Hasabnis, Niroop Ammbashankar, Faijul Amin, Sheng Fu, and Bhavani Subramanian. [n.d.]. Improving TensorFlow inference performance on Intel Xeon processors. Intel AI Blog ([n.d.]). intel.com. https://www.intel.com/content/www/us/en/artificial-intelligence/posts/improving-tensorflow-inference-performance-on-intel-xeon-processors.html.Google ScholarGoogle Scholar
  7. Google AI Blog. 2019. Introducing GPipe, an open source library for efficiently training large-scale neural network models. Retrieved from https://ai.googleblog.com/2019/03/introducing-gpipe-open-source-library.html.Google ScholarGoogle Scholar
  8. Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015).Google ScholarGoogle Scholar
  9. Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze et al. 2018. TVM: An Automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18). 578--594.Google ScholarGoogle Scholar
  10. Heng-Tze Cheng. 2016. Wide and deep learning: Better together with TensorFlow. Google AI Blog. Retrieved from https://ai.googleblog.com/2016/06/wide-deep-learning-better-together-with.html.Google ScholarGoogle Scholar
  11. Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for YouTube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 191--198.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248--255.Google ScholarGoogle ScholarCross RefCross Ref
  13. Eigen. 2019. Eigen thread pool. (2019). Retrieved from https://bitbucket.org/eigen/eigen/src/default/unsupported/Eigen/CXX11/src/ThreadPool/.Google ScholarGoogle Scholar
  14. Assaf Eisenman, Maxim Naumov, Darryl Gardner, Misha Smelyanskiy, Sergey Pupyrev, Kim Hazelwood, Asaf Cidon, and Sachin Katti. 2018. Bandana: Using non-volatile memory for storing deep learning models. arXiv preprint arXiv:1811.05922 (2018).Google ScholarGoogle Scholar
  15. Google. 2019. TensorFlow Performance Guide. Retrieved from https://docs.w3cub.com/tensorflow~guide/performance/performance_guide/#general_best_practices.Google ScholarGoogle Scholar
  16. Udit Gupta, Xiaodong Wang, Maxim Naumov, Carole-Jean Wu, Brandon Reagen, David Brooks, Bradford Cottel, Kim Hazelwood, Bill Jia, Hsien-Hsin S. Lee, Andrey Malevich, Dheevatsa Mudigere, Mikhail Smelyanskiy, Liang Xiong, and Xuan Zhang. 2019. The architectural implications of Facebook’s DNN-based personalized recommendation. arXiv preprint arXiv:1906.03109 (2019).Google ScholarGoogle Scholar
  17. Niranjan Hasabnis. 2018. Auto-tuning TensorFlow threading model for CPU backend. arXiv preprint arXiv:1812.01665 (2018).Google ScholarGoogle Scholar
  18. Kim Hazelwood, Sarah Bird, David Brooks, Soumith Chintala, Utku Diril, Dmytro Dzhulgakov, Mohamed Fawzy, Bill Jia, Yangqing Jia, Aditya Kalro, et al. 2018. Applied machine learning at Facebook: A datacenter infrastructure perspective. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’18). IEEE, 620--629.Google ScholarGoogle ScholarCross RefCross Ref
  19. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  20. Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 173--182.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4700--4708.Google ScholarGoogle Scholar
  22. Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size. arXiv preprint arXiv:1602.07360 (2016).Google ScholarGoogle Scholar
  23. Arpan Jain, Ammar Ahmad Awan, Quentin Anthony, Hari Subramoni, and Dhableswar K. D. K. Panda. 2019. Performance characterization of DNN training using Tensorflow and PyTorch on modern clusters. In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER’19). IEEE, 1--11.Google ScholarGoogle Scholar
  24. Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia. ACM, 675--678.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Nikhil Ketkar. 2017. Introduction to PyTorch. In Deep Learning with Python. Springer, 195--208.Google ScholarGoogle Scholar
  26. Primate Labs. 2019. GeekBench v4. Retrieved from https://www.geekbench.com/.Google ScholarGoogle Scholar
  27. Chris Lomont. 2011. Introduction to Intel advanced vector extensions. Intel White Paper (2011), 1--21. https://software.intel.com/content/dam/develop/external/us/en/documents/intro-to-intel-avx-183287.pdf.Google ScholarGoogle Scholar
  28. P. Mattson, V. J. Reddi, C. Cheng, C. Coleman, G. Diamos, D. Kanter, P. Micikevicius, D. Patterson, G. Schmuelling, H. Tang, G. Wei, and Carole-Jean Wu. 2020. MLPerf: An industry standard benchmark suite for machine learning performance. IEEE Micro 40, 2 (2020), 8--16.Google ScholarGoogle ScholarCross RefCross Ref
  29. Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G. Azzolini, Dmytro Dzhulgakov, Andrey Mallevich, Ilia Cherniavskii, Yinghai Lu, Raghuraman Krishnamoorthi, Ansha Yu, Volodymyr Kondratenko, Stephanie Pereira, Xianjie Chen, Wenlin Chen, Vijay Rao, Bill Jia, Liang Xiong, and Misha Smelyanskiy. 2019. Deep learning recommendation model for personalization and recommendation systems. arXiv preprint arXiv:1906.00091 (2019).Google ScholarGoogle Scholar
  30. Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breughe, Mark Charlebois, William Chou, et al. 2019. MLPerf inference benchmark. arXiv preprint arXiv:1911.02549 (2019).Google ScholarGoogle Scholar
  31. Shaohuai Shi, Qiang Wang, Pengfei Xu, and Xiaowen Chu. 2016. Benchmarking state-of-the-art deep learning software tools. In Proceedings of the 7th International Conference on Cloud Computing and Big Data (CCBD’16). IEEE, 99--104.Google ScholarGoogle ScholarCross RefCross Ref
  32. Akshitha Sriraman, Abhishek Dhanotia, and Thomas F. Wenisch. 2019. SoftSKU: Optimizing server architectures for microservice diversity at scale. In Proceedings of the 46th International Symposium on Computer Architecture. ACM, 513--526.Google ScholarGoogle Scholar
  33. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1--9.Google ScholarGoogle ScholarCross RefCross Ref
  34. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2818--2826.Google ScholarGoogle ScholarCross RefCross Ref
  35. Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S. Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. 2018. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. arXiv preprint arXiv:1802.04730 (2018).Google ScholarGoogle Scholar
  36. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 5998--6008.Google ScholarGoogle Scholar
  37. Carole-Jean Wu, David Brooks, Kevin Chen, Douglas Chen, Sy Choudhury, Marat Dukhan, Kim Hazelwood, Eldad Isaac, Yangqing Jia, Bill Jia, et al. 2019. Machine learning at Facebook: Understanding inference at the edge. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’19). IEEE, 331--344.Google ScholarGoogle ScholarCross RefCross Ref
  38. Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1492--1500.Google ScholarGoogle ScholarCross RefCross Ref
  39. Ahmad Yasin. 2014. A top-down method for performance analysis and counters architecture. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’14). IEEE, 35--44.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Exploiting Parallelism Opportunities with Deep Learning Frameworks

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Architecture and Code Optimization
        ACM Transactions on Architecture and Code Optimization  Volume 18, Issue 1
        March 2021
        402 pages
        ISSN:1544-3566
        EISSN:1544-3973
        DOI:10.1145/3446348
        Issue’s Table of Contents

        Copyright © 2020 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 30 December 2020
        • Accepted: 1 October 2020
        • Revised: 1 September 2020
        • Received: 1 May 2020
        Published in taco Volume 18, Issue 1

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format