Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3426422.3426980acmconferencesArticle/Chapter ViewAbstractPublication PagessplashConference Proceedingsconference-collections
research-article

DelayRepay: delayed execution for kernel fusion in Python

Published:15 November 2020Publication History

ABSTRACT

Python is a popular, dynamic language for data science and scientific computing. To ensure efficiency, significant numerical libraries are implemented in static native languages. However, performance suffers when switching between native and non-native code, especially if data has to be converted between native arrays and Python data structures. As GPU accelerators are increasingly used, this problem becomes particularly acute. Data and control has to be repeatedly transferred between the accelerator and the host.

In this paper, we present DelayRepay, a delayed execution framework for numeric Python programs. It avoids excessive switching and data transfer by using lazy evaluation and kernel fusion. Using DelayRepay, operations on NumPy arrays are executed lazily, allowing multiple calls to accelerator kernels to be fused together dynamically. DelayRepay is available as a drop-in replacement for existing Python libraries. This approach enables significant performance improvement over the state-of-the-art and is invisible to the application programmer. We show that our approach provides a maximum 377× speedup over NumPy - a 409% increase over the state of the art.

Skip Supplemental Material Section

Supplemental Material

dls20main-p7-p-video.mp4

mp4

31 MB

3426422.3426980.mp4

mp4

22.1 MB

References

  1. Arash Ashari, Shirish Tatikonda, Matthias Boehm, Berthold Reinwald, Keith Campbell, John Keenleyside, and P. Sadayappan. On optimizing machine learning workloads via kernel fusion. In Albert Cohen and David Grove, editors, Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2015, San Francisco, CA, USA, February 7-11, 2015, pages 173-182. ACM, 2015.Google ScholarGoogle Scholar
  2. Christian Batory, Lengauer Don, Charles Consel, and Martin Odersky. Domain-Specific Program Generation. Springer, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  3. Michael Bauer and Michael Garland. Legate NumPy: Accelerated and distributed array computing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1-23, 2019.Google ScholarGoogle Scholar
  4. Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. Legion: Expressing locality and independence with logical regions. In SC'12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pages 1-11. IEEE, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Geofrey Belter, Elizabeth R. Jessup, Ian Karlin, and Jeremy G. Siek. Automating the generation of composed linear algebra kernels. In Proceedings of the ACM/IEEE Conference on High Performance Computing, SC 2009, November 14-20, 2009, Portland, Oregon, USA. ACM, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Carl Friedrich Bolz, Darya Kurilova, and Laurence Tratt. Making an Embedded DBMS JIT-friendly. In Shriram Krishnamurthi and Benjamin S. Lerner, editors, 30th European Conference on Object-Oriented Programming (ECOOP 2016 ), volume 56 of Leibniz International Proceedings in Informatics (LIPIcs), pages 4 : 1-4 : 24, Dagstuhl, Germany, 2016. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.Google ScholarGoogle Scholar
  7. Bryan Catanzaro, Michael Garland, and Kurt Keutzer. Copperhead: compiling an embedded data parallel language. In Calin Cascaval and Pen-Chung Yew, editors, Proceedings of the 16th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2011, San Antonio, TX, USA, February 12-16, 2011, pages 47-56. ACM, 2011.Google ScholarGoogle Scholar
  8. Paul Feautrier. Some eficient solutions to the afine scheduling problem. i. one-dimensional time. Int. J. Parallel Program., 21 ( 5 ): 313-347, 1992.Google ScholarGoogle Scholar
  9. Juan Fumero, Michail Papadimitriou, Foivos S. Zakkak, Maria Xekalaki, James Clarkson, and Christos Kotselidis. Dynamic application reconifguration on heterogeneous hardware. In Proceedings of the 15th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, VEE 2019, page 165-178, New York, NY, USA, 2019. Association for Computing Machinery.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Juan José Fumero, Michel Steuwer, Lukas Stadler, and Christophe Dubach. Just-in-time GPU compilation for interpreted languages with partial evaluation. In Proceedings of the 13th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, VEE 2017, Xi'an, China, April 8-9, 2017, pages 60-73. ACM, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Matthias Grimmer, Chris Seaton, Thomas Würthinger, and Hanspeter Mössenböck. Dynamically composing languages in a modular way: supporting C extensions for dynamic languages. In Robert B. France, Sudipto Ghosh, and Gary T. Leavens, editors, Proceedings of the 14th International Conference on Modularity, MODULARITY 2015, Fort Collins, CO, USA, March 16-19, 2015, pages 1-13. ACM, 2015.Google ScholarGoogle Scholar
  12. Dejice Jacob, Phil Trinder, and Jeremy Singer. Python programmers have GPUs too: Automatic python loop parallelization with staged dependence analysis. In Proceedings of the 15th ACM SIGPLAN International Symposium on Dynamic Languages, DLS 2019, page 42-54, New York, NY, USA, 2019. Association for Computing Machinery.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Ken Kennedy and Kathryn S. McKinley. Maximizing loop parallelism and improving data locality via loop fusion and distribution. In Utpal Banerjee, David Gelernter, Alexandru Nicolau, and David A. Padua, editors, Languages and Compilers for Parallel Computing, 6th International Workshop, Portland, Oregon, USA, August 12-14, 1993, Proceedings, volume 768 of Lecture Notes in Computer Science, pages 301-320. Springer, 1993.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Andreas Klöckner, Nicolas Pinto, Yunsup Lee, B. Catanzaro, Paul Ivanov, and Ahmed Fasih. PyCUDA and PyOpenCL: A Scripting-Based Approach to GPU Run-Time Code Generation. Parallel Computing, 38 ( 3 ): 157-174, 2012.Google ScholarGoogle Scholar
  15. Mads RB Kristensen, Simon AF Lund, Troels Blum, Kenneth Skovhede, and Brian Vinter. Bohrium: Unmodified NumPy code on CPU, GPU, and cluster.Google ScholarGoogle Scholar
  16. Mads Ruben Burgdorf Kristensen, Simon Andreas Frimann Lund, Troels Blum, and James Avery. Fusion of parallel array operations. In Ayal Zaks, Bilha Mendelson, Lawrence Rauchwerger, and Wenmei W. Hwu, editors, Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, PACT 2016, Haifa, Israel, September 11-15, 2016, pages 71-85. ACM, 2016.Google ScholarGoogle Scholar
  17. Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. Numba: A LLVMbased python JIT compiler. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, LLVM '15, New York, NY, USA, 2015. Association for Computing Machinery.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Trevor L. McDonell, Manuel M. T. Chakravarty, Gabriele Keller, and Ben Lippmeier. Optimising purely functional GPU programs. In Greg Morrisett and Tarmo Uustalu, editors, ACM SIGPLAN International Conference on Functional Programming, ICFP'13, Boston, MA, USA-September 25-27, 2013, pages 49-60. ACM, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Wes McKinney. Data structures for statistical computing in python. In Stéfan van der Walt and Jarrod Millman, editors, Proceedings of the 9th Python in Science Conference, pages 51-56, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  20. Royud Nishino and Shohei Hido Crissman Loomis. CuPy: A NumPycompatible library for NVIDIA GPU calculations. 31st Confernce on Neural Information Processing Systems, page 151, 2017.Google ScholarGoogle Scholar
  21. Shoumik Palkar, James J Thomas, Anil Shanbhag, Deepak Narayanan, Holger Pirk, Malte Schwarzkopf, Saman Amarasinghe, Matei Zaharia, and Stanford InfoLab. Weld: A common runtime for high performance data analytics. In Conference on Innovative Data Systems Research (CIDR), page 45, 2017.Google ScholarGoogle Scholar
  22. Uday Pitambare, Arun Chauhan, and Saurabh Malviya. Just-in-time Acceleration of JavaScript. Technical report, 2013.Google ScholarGoogle Scholar
  23. Florian Rathgeber, David A. Ham, Lawrence Mitchell, Michael Lange, Fabio Luporini, Andrew T. T. Mcrae, Gheorghe-Teodor Bercea, Graham R. Markall, and Paul H. J. Kelly. Firedrake: Automating the finite element method by composing abstractions. ACM Transactions on Mathematical Software, 43 ( 3 ), December 2016.Google ScholarGoogle Scholar
  24. Sven-Bodo Scholz. Single assignment C: eficient support for high-level array operations in a functional setting. J. Funct. Program., 13 ( 6 ): 1005-1059, 2003.Google ScholarGoogle Scholar
  25. Chun-Yu Shei, Adarsh Yoga, Madhav Ramesh, and Arun Chauhan. MATLAB parallelization through scalarization. In 15th Workshop on Interaction between Compilers and Computer Architectures, INTERACT 2011, San Antonio, Texas, USA, February 12, 2011, pages 44-53. IEEE Computer Society, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Arvind K. Sujeeth, Kevin J. Brown, Hyoukjoong Lee, Tiark Rompf, Hassan Chafi, Martin Odersky, and Kunle Olukotun. Delite: A compiler architecture for performance-oriented embedded domain-specific languages. ACM Transactions on Embedded Computing Systems, 13 ( 4s ), April 2014.Google ScholarGoogle Scholar
  27. Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, CJ Carey, İlhan Polat, Yu Feng, Eric W. Moore, Jake Vand erPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy 1. 0 Contributors. SciPy 1. 0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 2020.Google ScholarGoogle Scholar
  28. Guibin Wang, YiSong Lin, and Wei Yi. Kernel fusion: An efective method for better power eficiency on multithreaded GPU. In 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing, pages 344-350. IEEE, 2010.Google ScholarGoogle Scholar
  29. Haicheng Wu, Gregory Diamos, Jin Wang, Srihari Cadambi, Sudhakar Yalamanchili, and Srimat Chakradhar. Optimizing data warehousing applications for gpus using kernel fusion/fission. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum, pages 2433-2442. IEEE, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. DelayRepay: delayed execution for kernel fusion in Python

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          DLS 2020: Proceedings of the 16th ACM SIGPLAN International Symposium on Dynamic Languages
          November 2020
          125 pages
          ISBN:9781450381758
          DOI:10.1145/3426422

          Copyright © 2020 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 15 November 2020

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate32of77submissions,42%

          Upcoming Conference

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader