research-article

DelayRepay: delayed execution for kernel fusion in Python

Authors:
John Magnus Morton

University of Edinburgh, UK

University of Edinburgh, UK

0000-0002-2974-2767
View Profile

,
Kuba Kaszyk

University of Edinburgh, UK

University of Edinburgh, UK
View Profile

,
Lu Li

University of Edinburgh, UK

University of Edinburgh, UK
View Profile

,
Jiawen Sun

University of Edinburgh, UK

University of Edinburgh, UK
View Profile

,
Christophe Dubach

McGill University, Canada

McGill University, Canada
View Profile

,
Michel Steuwer

University of Edinburgh, UK

University of Edinburgh, UK
View Profile

,
Murray Cole

University of Edinburgh, UK

University of Edinburgh, UK
View Profile

,
Michael F. P. O'Boyle

University of Edinburgh, UK

University of Edinburgh, UK

0000-0003-1619-5052
View Profile

DLS 2020: Proceedings of the 16th ACM SIGPLAN International Symposium on Dynamic LanguagesNovember 2020Pages 43–56https://doi.org/10.1145/3426422.3426980

Published:15 November 2020Publication History

DLS 2020: Proceedings of the 16th ACM SIGPLAN International Symposium on Dynamic Languages

Pages 43–56

ABSTRACT

Python is a popular, dynamic language for data science and scientific computing. To ensure efficiency, significant numerical libraries are implemented in static native languages. However, performance suffers when switching between native and non-native code, especially if data has to be converted between native arrays and Python data structures. As GPU accelerators are increasingly used, this problem becomes particularly acute. Data and control has to be repeatedly transferred between the accelerator and the host.

In this paper, we present DelayRepay, a delayed execution framework for numeric Python programs. It avoids excessive switching and data transfer by using lazy evaluation and kernel fusion. Using DelayRepay, operations on NumPy arrays are executed lazily, allowing multiple calls to accelerator kernels to be fused together dynamically. DelayRepay is available as a drop-in replacement for existing Python libraries. This approach enables significant performance improvement over the state-of-the-art and is invisible to the application programmer. We show that our approach provides a maximum 377× speedup over NumPy - a 409% increase over the state of the art.

Supplemental Material

dls20main-p7-p-video.mp4

mp4

31 MB

Download

3426422.3426980.mp4

mp4

22.1 MB

Download

References

Arash Ashari, Shirish Tatikonda, Matthias Boehm, Berthold Reinwald, Keith Campbell, John Keenleyside, and P. Sadayappan. On optimizing machine learning workloads via kernel fusion. In Albert Cohen and David Grove, editors, Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2015, San Francisco, CA, USA, February 7-11, 2015, pages 173-182. ACM, 2015.Google Scholar
Christian Batory, Lengauer Don, Charles Consel, and Martin Odersky. Domain-Specific Program Generation. Springer, 2004.Google ScholarCross Ref
Michael Bauer and Michael Garland. Legate NumPy: Accelerated and distributed array computing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1-23, 2019.Google Scholar
Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. Legion: Expressing locality and independence with logical regions. In SC'12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pages 1-11. IEEE, 2012.Google ScholarDigital Library
Geofrey Belter, Elizabeth R. Jessup, Ian Karlin, and Jeremy G. Siek. Automating the generation of composed linear algebra kernels. In Proceedings of the ACM/IEEE Conference on High Performance Computing, SC 2009, November 14-20, 2009, Portland, Oregon, USA. ACM, 2009.Google ScholarDigital Library
Carl Friedrich Bolz, Darya Kurilova, and Laurence Tratt. Making an Embedded DBMS JIT-friendly. In Shriram Krishnamurthi and Benjamin S. Lerner, editors, 30th European Conference on Object-Oriented Programming (ECOOP 2016 ), volume 56 of Leibniz International Proceedings in Informatics (LIPIcs), pages 4 : 1-4 : 24, Dagstuhl, Germany, 2016. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.Google Scholar
Bryan Catanzaro, Michael Garland, and Kurt Keutzer. Copperhead: compiling an embedded data parallel language. In Calin Cascaval and Pen-Chung Yew, editors, Proceedings of the 16th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2011, San Antonio, TX, USA, February 12-16, 2011, pages 47-56. ACM, 2011.Google Scholar
Paul Feautrier. Some eficient solutions to the afine scheduling problem. i. one-dimensional time. Int. J. Parallel Program., 21 ( 5 ): 313-347, 1992.Google Scholar
Juan Fumero, Michail Papadimitriou, Foivos S. Zakkak, Maria Xekalaki, James Clarkson, and Christos Kotselidis. Dynamic application reconifguration on heterogeneous hardware. In Proceedings of the 15th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, VEE 2019, page 165-178, New York, NY, USA, 2019. Association for Computing Machinery.Google ScholarDigital Library
Juan José Fumero, Michel Steuwer, Lukas Stadler, and Christophe Dubach. Just-in-time GPU compilation for interpreted languages with partial evaluation. In Proceedings of the 13th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, VEE 2017, Xi'an, China, April 8-9, 2017, pages 60-73. ACM, 2017.Google ScholarDigital Library
Matthias Grimmer, Chris Seaton, Thomas Würthinger, and Hanspeter Mössenböck. Dynamically composing languages in a modular way: supporting C extensions for dynamic languages. In Robert B. France, Sudipto Ghosh, and Gary T. Leavens, editors, Proceedings of the 14th International Conference on Modularity, MODULARITY 2015, Fort Collins, CO, USA, March 16-19, 2015, pages 1-13. ACM, 2015.Google Scholar
Dejice Jacob, Phil Trinder, and Jeremy Singer. Python programmers have GPUs too: Automatic python loop parallelization with staged dependence analysis. In Proceedings of the 15th ACM SIGPLAN International Symposium on Dynamic Languages, DLS 2019, page 42-54, New York, NY, USA, 2019. Association for Computing Machinery.Google ScholarDigital Library
Ken Kennedy and Kathryn S. McKinley. Maximizing loop parallelism and improving data locality via loop fusion and distribution. In Utpal Banerjee, David Gelernter, Alexandru Nicolau, and David A. Padua, editors, Languages and Compilers for Parallel Computing, 6th International Workshop, Portland, Oregon, USA, August 12-14, 1993, Proceedings, volume 768 of Lecture Notes in Computer Science, pages 301-320. Springer, 1993.Google ScholarDigital Library
Andreas Klöckner, Nicolas Pinto, Yunsup Lee, B. Catanzaro, Paul Ivanov, and Ahmed Fasih. PyCUDA and PyOpenCL: A Scripting-Based Approach to GPU Run-Time Code Generation. Parallel Computing, 38 ( 3 ): 157-174, 2012.Google Scholar
Mads RB Kristensen, Simon AF Lund, Troels Blum, Kenneth Skovhede, and Brian Vinter. Bohrium: Unmodified NumPy code on CPU, GPU, and cluster.Google Scholar
Mads Ruben Burgdorf Kristensen, Simon Andreas Frimann Lund, Troels Blum, and James Avery. Fusion of parallel array operations. In Ayal Zaks, Bilha Mendelson, Lawrence Rauchwerger, and Wenmei W. Hwu, editors, Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, PACT 2016, Haifa, Israel, September 11-15, 2016, pages 71-85. ACM, 2016.Google Scholar
Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. Numba: A LLVMbased python JIT compiler. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, LLVM '15, New York, NY, USA, 2015. Association for Computing Machinery.Google ScholarDigital Library
Trevor L. McDonell, Manuel M. T. Chakravarty, Gabriele Keller, and Ben Lippmeier. Optimising purely functional GPU programs. In Greg Morrisett and Tarmo Uustalu, editors, ACM SIGPLAN International Conference on Functional Programming, ICFP'13, Boston, MA, USA-September 25-27, 2013, pages 49-60. ACM, 2013.Google ScholarDigital Library
Wes McKinney. Data structures for statistical computing in python. In Stéfan van der Walt and Jarrod Millman, editors, Proceedings of the 9th Python in Science Conference, pages 51-56, 2010.Google ScholarCross Ref
Royud Nishino and Shohei Hido Crissman Loomis. CuPy: A NumPycompatible library for NVIDIA GPU calculations. 31st Confernce on Neural Information Processing Systems, page 151, 2017.Google Scholar
Shoumik Palkar, James J Thomas, Anil Shanbhag, Deepak Narayanan, Holger Pirk, Malte Schwarzkopf, Saman Amarasinghe, Matei Zaharia, and Stanford InfoLab. Weld: A common runtime for high performance data analytics. In Conference on Innovative Data Systems Research (CIDR), page 45, 2017.Google Scholar
Uday Pitambare, Arun Chauhan, and Saurabh Malviya. Just-in-time Acceleration of JavaScript. Technical report, 2013.Google Scholar
Florian Rathgeber, David A. Ham, Lawrence Mitchell, Michael Lange, Fabio Luporini, Andrew T. T. Mcrae, Gheorghe-Teodor Bercea, Graham R. Markall, and Paul H. J. Kelly. Firedrake: Automating the finite element method by composing abstractions. ACM Transactions on Mathematical Software, 43 ( 3 ), December 2016.Google Scholar
Sven-Bodo Scholz. Single assignment C: eficient support for high-level array operations in a functional setting. J. Funct. Program., 13 ( 6 ): 1005-1059, 2003.Google Scholar
Chun-Yu Shei, Adarsh Yoga, Madhav Ramesh, and Arun Chauhan. MATLAB parallelization through scalarization. In 15th Workshop on Interaction between Compilers and Computer Architectures, INTERACT 2011, San Antonio, Texas, USA, February 12, 2011, pages 44-53. IEEE Computer Society, 2011.Google ScholarDigital Library
Arvind K. Sujeeth, Kevin J. Brown, Hyoukjoong Lee, Tiark Rompf, Hassan Chafi, Martin Odersky, and Kunle Olukotun. Delite: A compiler architecture for performance-oriented embedded domain-specific languages. ACM Transactions on Embedded Computing Systems, 13 ( 4s ), April 2014.Google Scholar
Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, CJ Carey, İlhan Polat, Yu Feng, Eric W. Moore, Jake Vand erPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy 1. 0 Contributors. SciPy 1. 0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 2020.Google Scholar
Guibin Wang, YiSong Lin, and Wei Yi. Kernel fusion: An efective method for better power eficiency on multithreaded GPU. In 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing, pages 344-350. IEEE, 2010.Google Scholar
Haicheng Wu, Gregory Diamos, Jin Wang, Srihari Cadambi, Sudhakar Yalamanchili, and Srimat Chakradhar. Optimizing data warehousing applications for gpus using kernel fusion/fission. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum, pages 2433-2442. IEEE, 2012.Google ScholarDigital Library

Index Terms

DelayRepay: delayed execution for kernel fusion in Python

Recommendations

High-performance CUDA kernel execution on FPGAs
ICS '09: Proceedings of the 23rd international conference on Supercomputing

In this work, we propose a new FPGA design flow that combines the CUDA programming model from Nvidia with the state of the art high-level synthesis tool AutoPilot from AutoESL, to efficiently map the exposed parallelism in CUDA kernels onto ...
Read More
Heterogeneous concurrent execution of Monte Carlo photon transport on CPU, GPU and MIC
IA³ '14: Proceedings of the 4th Workshop on Irregular Applications: Architectures and Algorithms

In this paper, a new level of heterogeneous concurrent execution of Monte Carlo photon transport is presented. ARCHER, an application for computing radiation dosimetry for CT imaging involving whole-body patient phantoms has been extended to execute on ...
Read More
Exploration of CPU/GPU co-execution: from the perspective of performance, energy, and temperature
RACS '11: Proceedings of the 2011 ACM Symposium on Research in Applied Computation

In recent computing systems, CPUs have encountered the situations in which they cannot meet the increasing throughput demands. To overcome the limits of CPUs in processing heavy tasks, especially for computer graphics, GPUs have been widely used. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

DLS 2020: Proceedings of the 16th ACM SIGPLAN International Symposium on Dynamic Languages
November 2020
125 pages
ISBN:9781450381758
DOI:10.1145/3426422

Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 November 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
GPU
code fusion
delayed evaluation
dynamic compilation
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate32of77submissions,42%
Upcoming Conference
SPLASH '24

Sponsor:

sigplan

ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity

October 20 - 25, 2024

Pasadena , CA , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 119
  Total Downloads
- Downloads (Last 12 months)21
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

DelayRepay: delayed execution for kernel fusion in Python

DLS 2020: Proceedings of the 16th ACM SIGPLAN International Symposium on Dynamic Languages

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

High-performance CUDA kernel execution on FPGAs

Heterogeneous concurrent execution of Monte Carlo photon transport on CPU, GPU and MIC

Exploration of CPU/GPU co-execution: from the perspective of performance, energy, and temperature

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

DelayRepay: delayed execution for kernel fusion in Python

DLS 2020: Proceedings of the 16th ACM SIGPLAN International Symposium on Dynamic Languages

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

High-performance CUDA kernel execution on FPGAs

Heterogeneous concurrent execution of Monte Carlo photon transport on CPU, GPU and MIC

Exploration of CPU/GPU co-execution: from the perspective of performance, energy, and temperature

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media