ABSTRACT
Python is a popular, dynamic language for data science and scientific computing. To ensure efficiency, significant numerical libraries are implemented in static native languages. However, performance suffers when switching between native and non-native code, especially if data has to be converted between native arrays and Python data structures. As GPU accelerators are increasingly used, this problem becomes particularly acute. Data and control has to be repeatedly transferred between the accelerator and the host.
In this paper, we present DelayRepay, a delayed execution framework for numeric Python programs. It avoids excessive switching and data transfer by using lazy evaluation and kernel fusion. Using DelayRepay, operations on NumPy arrays are executed lazily, allowing multiple calls to accelerator kernels to be fused together dynamically. DelayRepay is available as a drop-in replacement for existing Python libraries. This approach enables significant performance improvement over the state-of-the-art and is invisible to the application programmer. We show that our approach provides a maximum 377× speedup over NumPy - a 409% increase over the state of the art.
Supplemental Material
- Arash Ashari, Shirish Tatikonda, Matthias Boehm, Berthold Reinwald, Keith Campbell, John Keenleyside, and P. Sadayappan. On optimizing machine learning workloads via kernel fusion. In Albert Cohen and David Grove, editors, Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2015, San Francisco, CA, USA, February 7-11, 2015, pages 173-182. ACM, 2015.Google Scholar
- Christian Batory, Lengauer Don, Charles Consel, and Martin Odersky. Domain-Specific Program Generation. Springer, 2004.Google ScholarCross Ref
- Michael Bauer and Michael Garland. Legate NumPy: Accelerated and distributed array computing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1-23, 2019.Google Scholar
- Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. Legion: Expressing locality and independence with logical regions. In SC'12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pages 1-11. IEEE, 2012.Google ScholarDigital Library
- Geofrey Belter, Elizabeth R. Jessup, Ian Karlin, and Jeremy G. Siek. Automating the generation of composed linear algebra kernels. In Proceedings of the ACM/IEEE Conference on High Performance Computing, SC 2009, November 14-20, 2009, Portland, Oregon, USA. ACM, 2009.Google ScholarDigital Library
- Carl Friedrich Bolz, Darya Kurilova, and Laurence Tratt. Making an Embedded DBMS JIT-friendly. In Shriram Krishnamurthi and Benjamin S. Lerner, editors, 30th European Conference on Object-Oriented Programming (ECOOP 2016 ), volume 56 of Leibniz International Proceedings in Informatics (LIPIcs), pages 4 : 1-4 : 24, Dagstuhl, Germany, 2016. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.Google Scholar
- Bryan Catanzaro, Michael Garland, and Kurt Keutzer. Copperhead: compiling an embedded data parallel language. In Calin Cascaval and Pen-Chung Yew, editors, Proceedings of the 16th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2011, San Antonio, TX, USA, February 12-16, 2011, pages 47-56. ACM, 2011.Google Scholar
- Paul Feautrier. Some eficient solutions to the afine scheduling problem. i. one-dimensional time. Int. J. Parallel Program., 21 ( 5 ): 313-347, 1992.Google Scholar
- Juan Fumero, Michail Papadimitriou, Foivos S. Zakkak, Maria Xekalaki, James Clarkson, and Christos Kotselidis. Dynamic application reconifguration on heterogeneous hardware. In Proceedings of the 15th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, VEE 2019, page 165-178, New York, NY, USA, 2019. Association for Computing Machinery.Google ScholarDigital Library
- Juan José Fumero, Michel Steuwer, Lukas Stadler, and Christophe Dubach. Just-in-time GPU compilation for interpreted languages with partial evaluation. In Proceedings of the 13th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, VEE 2017, Xi'an, China, April 8-9, 2017, pages 60-73. ACM, 2017.Google ScholarDigital Library
- Matthias Grimmer, Chris Seaton, Thomas Würthinger, and Hanspeter Mössenböck. Dynamically composing languages in a modular way: supporting C extensions for dynamic languages. In Robert B. France, Sudipto Ghosh, and Gary T. Leavens, editors, Proceedings of the 14th International Conference on Modularity, MODULARITY 2015, Fort Collins, CO, USA, March 16-19, 2015, pages 1-13. ACM, 2015.Google Scholar
- Dejice Jacob, Phil Trinder, and Jeremy Singer. Python programmers have GPUs too: Automatic python loop parallelization with staged dependence analysis. In Proceedings of the 15th ACM SIGPLAN International Symposium on Dynamic Languages, DLS 2019, page 42-54, New York, NY, USA, 2019. Association for Computing Machinery.Google ScholarDigital Library
- Ken Kennedy and Kathryn S. McKinley. Maximizing loop parallelism and improving data locality via loop fusion and distribution. In Utpal Banerjee, David Gelernter, Alexandru Nicolau, and David A. Padua, editors, Languages and Compilers for Parallel Computing, 6th International Workshop, Portland, Oregon, USA, August 12-14, 1993, Proceedings, volume 768 of Lecture Notes in Computer Science, pages 301-320. Springer, 1993.Google ScholarDigital Library
- Andreas Klöckner, Nicolas Pinto, Yunsup Lee, B. Catanzaro, Paul Ivanov, and Ahmed Fasih. PyCUDA and PyOpenCL: A Scripting-Based Approach to GPU Run-Time Code Generation. Parallel Computing, 38 ( 3 ): 157-174, 2012.Google Scholar
- Mads RB Kristensen, Simon AF Lund, Troels Blum, Kenneth Skovhede, and Brian Vinter. Bohrium: Unmodified NumPy code on CPU, GPU, and cluster.Google Scholar
- Mads Ruben Burgdorf Kristensen, Simon Andreas Frimann Lund, Troels Blum, and James Avery. Fusion of parallel array operations. In Ayal Zaks, Bilha Mendelson, Lawrence Rauchwerger, and Wenmei W. Hwu, editors, Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, PACT 2016, Haifa, Israel, September 11-15, 2016, pages 71-85. ACM, 2016.Google Scholar
- Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. Numba: A LLVMbased python JIT compiler. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, LLVM '15, New York, NY, USA, 2015. Association for Computing Machinery.Google ScholarDigital Library
- Trevor L. McDonell, Manuel M. T. Chakravarty, Gabriele Keller, and Ben Lippmeier. Optimising purely functional GPU programs. In Greg Morrisett and Tarmo Uustalu, editors, ACM SIGPLAN International Conference on Functional Programming, ICFP'13, Boston, MA, USA-September 25-27, 2013, pages 49-60. ACM, 2013.Google ScholarDigital Library
- Wes McKinney. Data structures for statistical computing in python. In Stéfan van der Walt and Jarrod Millman, editors, Proceedings of the 9th Python in Science Conference, pages 51-56, 2010.Google ScholarCross Ref
- Royud Nishino and Shohei Hido Crissman Loomis. CuPy: A NumPycompatible library for NVIDIA GPU calculations. 31st Confernce on Neural Information Processing Systems, page 151, 2017.Google Scholar
- Shoumik Palkar, James J Thomas, Anil Shanbhag, Deepak Narayanan, Holger Pirk, Malte Schwarzkopf, Saman Amarasinghe, Matei Zaharia, and Stanford InfoLab. Weld: A common runtime for high performance data analytics. In Conference on Innovative Data Systems Research (CIDR), page 45, 2017.Google Scholar
- Uday Pitambare, Arun Chauhan, and Saurabh Malviya. Just-in-time Acceleration of JavaScript. Technical report, 2013.Google Scholar
- Florian Rathgeber, David A. Ham, Lawrence Mitchell, Michael Lange, Fabio Luporini, Andrew T. T. Mcrae, Gheorghe-Teodor Bercea, Graham R. Markall, and Paul H. J. Kelly. Firedrake: Automating the finite element method by composing abstractions. ACM Transactions on Mathematical Software, 43 ( 3 ), December 2016.Google Scholar
- Sven-Bodo Scholz. Single assignment C: eficient support for high-level array operations in a functional setting. J. Funct. Program., 13 ( 6 ): 1005-1059, 2003.Google Scholar
- Chun-Yu Shei, Adarsh Yoga, Madhav Ramesh, and Arun Chauhan. MATLAB parallelization through scalarization. In 15th Workshop on Interaction between Compilers and Computer Architectures, INTERACT 2011, San Antonio, Texas, USA, February 12, 2011, pages 44-53. IEEE Computer Society, 2011.Google ScholarDigital Library
- Arvind K. Sujeeth, Kevin J. Brown, Hyoukjoong Lee, Tiark Rompf, Hassan Chafi, Martin Odersky, and Kunle Olukotun. Delite: A compiler architecture for performance-oriented embedded domain-specific languages. ACM Transactions on Embedded Computing Systems, 13 ( 4s ), April 2014.Google Scholar
- Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, CJ Carey, İlhan Polat, Yu Feng, Eric W. Moore, Jake Vand erPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy 1. 0 Contributors. SciPy 1. 0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 2020.Google Scholar
- Guibin Wang, YiSong Lin, and Wei Yi. Kernel fusion: An efective method for better power eficiency on multithreaded GPU. In 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing, pages 344-350. IEEE, 2010.Google Scholar
- Haicheng Wu, Gregory Diamos, Jin Wang, Srihari Cadambi, Sudhakar Yalamanchili, and Srimat Chakradhar. Optimizing data warehousing applications for gpus using kernel fusion/fission. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum, pages 2433-2442. IEEE, 2012.Google ScholarDigital Library
Index Terms
- DelayRepay: delayed execution for kernel fusion in Python
Recommendations
High-performance CUDA kernel execution on FPGAs
ICS '09: Proceedings of the 23rd international conference on SupercomputingIn this work, we propose a new FPGA design flow that combines the CUDA programming model from Nvidia with the state of the art high-level synthesis tool AutoPilot from AutoESL, to efficiently map the exposed parallelism in CUDA kernels onto ...
Heterogeneous concurrent execution of Monte Carlo photon transport on CPU, GPU and MIC
IA3 '14: Proceedings of the 4th Workshop on Irregular Applications: Architectures and AlgorithmsIn this paper, a new level of heterogeneous concurrent execution of Monte Carlo photon transport is presented. ARCHER, an application for computing radiation dosimetry for CT imaging involving whole-body patient phantoms has been extended to execute on ...
Exploration of CPU/GPU co-execution: from the perspective of performance, energy, and temperature
RACS '11: Proceedings of the 2011 ACM Symposium on Research in Applied ComputationIn recent computing systems, CPUs have encountered the situations in which they cannot meet the increasing throughput demands. To overcome the limits of CPUs in processing heavy tasks, especially for computer graphics, GPUs have been widely used. ...
Comments