Domain-Specific Multi-Level IR Rewriting for GPU: The Open Earth Compiler for GPU-accelerated Climate Simulation

Authors:
Tobias Gysi

ETH Zurich, Switzerland

ETH Zurich, Switzerland
View Profile

,
Christoph Müller

ETH Zurich, Switzerland

ETH Zurich, Switzerland
View Profile

,
Oleksandr Zinenko

Google, France

Google, France

0000-0003-1978-0222
View Profile

,
Stephan Herhut

Google, Germany

Google, Germany
View Profile

,
Eddie Davis

Vulcan Inc, USA

Vulcan Inc, USA
View Profile

,
Tobias Wicky

Vulcan Inc, USA

Vulcan Inc, USA
View Profile

,
Oliver Fuhrer

Vulcan Inc, USA

Vulcan Inc, USA
View Profile

,
Torsten Hoefler

ETH Zurich, Switzerland

ETH Zurich, Switzerland
View Profile

,
Tobias Grosser

University of Edinburgh, UK

University of Edinburgh, UK
View Profile

ACM Transactions on Architecture and Code Optimization Volume 18 Issue 4Article No.: 51pp 1–23https://doi.org/10.1145/3469030

Published:03 September 2021Publication History

ACM Transactions on Architecture and Code Optimization

Abstract

Most compilers have a single core intermediate representation (IR) (e.g., LLVM) sometimes complemented with vaguely defined IR-like data structures. This IR is commonly low-level and close to machine instructions. As a result, optimizations relying on domain-specific information are either not possible or require complex analysis to recover the missing information. In contrast, multi-level rewriting instantiates a hierarchy of dialects (IRs), lowers programs level-by-level, and performs code transformations at the most suitable level. We demonstrate the effectiveness of this approach for the weather and climate domain. In particular, we develop a prototype compiler and design stencil- and GPU-specific dialects based on a set of newly introduced design principles. We find that two domain-specific optimizations (500 lines of code) realized on top of LLVM’s extensible MLIR compiler infrastructure suffice to outperform state-of-the-art solutions. In essence, multi-level rewriting promises to herald the age of specialized compilers composed from domain- and target-specific dialects implemented on top of a shared infrastructure.

References

2020. CLIMA. Retrieved from https://github.com/climate-machine/CLIMA/.Google Scholar
2020. Consortium for Small-scale Modeling. Retrieved from http://www.cosmo-model.org/.Google Scholar
2020. FV3: Finite-Volume Cubed-Sphere Dynamical Core. Retrieved from https://www.gfdl.noaa.gov/fv3/.Google Scholar
2020. GridTools. Retrieved from https://github.com/GridTools/gridtools.Google Scholar
2020. GT4Py. Retrieved from https://github.com/gridtools/gt4py.Google Scholar
2020. RAJA. Retrieved from https://github.com/LLNL/RAJA.Google Scholar
Andrew Adams, Karima Ma, Luke Anderson, Riyadh Baghdadi, Tzu-Mao Li, Michaël Gharbi, Benoit Steiner, Steven Johnson, Kayvon Fatahalian, Frédo Durand, and Jonathan Ragan-Kelley. 2019. Learning to optimize halide with tree search and random programs. ACM Trans. Graph. 38, 4 (July 2019). DOI:https://doi.org/10.1145/3306346.3322967Google ScholarDigital Library
S. V. Adams, R. W. Ford, M. Hambley, J. M. Hobson, I. Kavc̆ic̆, C. M. Maynard, T. Melvin, E. H. Müller, S. Mullerworth, A. R. Porter, M. Rezny, B. J. Shipway, and R. Wong. 2019. LFRic: Meeting the challenges of scalability and performance portability in weather and climate models. J. Parallel Distrib. Comput. 132 (2019), 383–396. DOI:https://doi.org/10.1016/j.jpdc.2019.02.007Google Scholar
R. Baghdadi, U. Beaugnon, A. Cohen, T. Grosser, M. Kruse, C. Reddy, S. Verdoolaege, A. Betts, A. F. Donaldson, J. Ketema, J. Absar, S. v. Haastregt, A. Kravets, A. Lokhmotov, R. David, and E. Hajiyev. 2015. PENCIL: A platform-neutral compute intermediate language for accelerator programming. In Proceedings of the International Conference on Parallel Architecture and Compilation (PACT’15). 138–149.Google Scholar
Michael Baldauf, Axel Seifert, Jochen Förstner, Detlev Majewski, Matthias Raschendorfer, and Thorsten Reinhardt. 2011. Operational convective-scale numerical weather prediction with the COSMO model: Description and sensitivities. Month. Weath. Rev. 139, 12 (2011), 3887–3905.Google ScholarCross Ref
Ulysse Beaugnon, Antoine Pouille, Marc Pouzet, Jacques Pienaar, and Albert Cohen. 2017. Optimization space pruning without regrets. In Proceedings of the 26th International Conference on Compiler Construction. 34–44.Google ScholarDigital Library
Tal Ben-Nun, Johannes de Fine Licht, Alexandros Nikolaos Ziogas, Timo Schneider, and Torsten Hoefler. 2019. Stateful dataflow multigraphs: A data-centric model for performance portability on heterogeneous architectures. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’19).Google ScholarDigital Library
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18). USENIX Association, 578–594. Retrieved from https://www.usenix.org/conference/osdi18/presentation/chen.Google Scholar
Valentin Clement, Sylvaine Ferrachat, Oliver Fuhrer, Xavier Lapillonne, Carlos E. Osuna, Robert Pincus, Jon Rood, and William Sawyer. 2018. The CLAW DSL: Abstractions for performance portable weather and climate models. In Proceedings of the Platform for Advanced Scientific Computing Conference (PASC’18). Association for Computing Machinery, New York, NY. DOI:https://doi.org/10.1145/3218176.3218226Google ScholarDigital Library
Zachary DeVito, James Hegarty, Alex Aiken, Pat Hanrahan, and Jan Vitek. 2013. Terra: A multi-stage language for high-performance computing. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’13). Association for Computing Machinery, New York, NY, 105–116. DOI:https://doi.org/10.1145/2491956.2462166Google ScholarDigital Library
H. C. Edwards and C. R. Trott. 2013. Kokkos: Enabling performance portability across manycore architectures. In Proceedings of the Extreme Scaling Workshop (XSW’13). 18–24.Google Scholar
Oliver Fuhrer, Carlos Osuna, Xavier Lapillonne, Tobias Gysi, Ben Cumming, Mauro Bianco, Andrea Arteaga, and Thomas Schulthess. 2014. Towards a performance portable, architecture agnostic implementation strategy for weather and climate models. Supercomput. Front. Innov. 1, 1 (2014).Google Scholar
Tobias Grosser, Albert Cohen, Justin Holewinski, P. Sadayappan, and Sven Verdoolaege. 2014. Hybrid hexagonal/classical tiling for GPUs. In Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization (CGO’14). Association for Computing Machinery, New York, NY, 66–75. DOI:https://doi.org/10.1145/2544137.2544160Google ScholarDigital Library
Tobias Grosser, Albert Cohen, Paul H. J. Kelly, J. Ramanujam, P. Sadayappan, and Sven Verdoolaege. 2013. Split tiling for GPUs: Automatic parallelization using trapezoidal tiles. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units (GPGPU’13). Association for Computing Machinery, New York, NY, 24–31. DOI:https://doi.org/10.1145/2458523.2458526Google ScholarDigital Library
Tobias Grosser, Armin Groesslinger, and Christian Lengauer. 2012. Polly—Performing polyhedral optimizations on a low-level intermediate representation. Parallel Process. Lett. 22, 04 (2012), 1250010.Google ScholarCross Ref
Tobias Grosser and Torsten Hoefler. 2016. Polly-ACC transparent compilation to heterogeneous hardware. In Proceedings of the International Conference on Supercomputing (ICS’16). Association for Computing Machinery, New York, NY. DOI:https://doi.org/10.1145/2925426.2926286Google ScholarDigital Library
Tobias Gysi, Tobias Grosser, and Torsten Hoefler. 2015. MODESTO: Data-centric analytic optimization of complex stencil programs on heterogeneous architectures. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS’15). Association for Computing Machinery, New York, NY, 177–186. DOI:https://doi.org/10.1145/2751205.2751223Google ScholarDigital Library
T. Gysi, T. Grosser, and T. Hoefler. 2019. Absinthe: Learning an analytical performance model to fuse and tile stencil codes in one shot. In Proceedings of the 28th International Conference on Parallel Architectures and Compilation Techniques (PACT’19). 370–382.Google Scholar
Tobias Gysi, Carlos Osuna, Oliver Fuhrer, Mauro Bianco, and Thomas C. Schulthess. 2015. STELLA: A domain-specific tool for structured grid methods in weather and climate models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’15). Association for Computing Machinery, New York, NY. DOI:https://doi.org/10.1145/2807591.2807627Google Scholar
Bastian Hagedorn, Larisa Stoltzfus, Michel Steuwer, Sergei Gorlatch, and Christophe Dubach. 2018. High performance stencil code generation with lift. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’18). Association for Computing Machinery, New York, NY, 100–112. DOI:https://doi.org/10.1145/3168824Google Scholar
Lucas M. Harris and Shian-Jiann Lin. 2013. A two-way nested global-regional dynamical core on the cubed-sphere grid. Month. Weath. Rev. 141, 1 (2013), 283–306. DOI:https://doi.org/10.1175/MWR-D-11-00201.1Google ScholarCross Ref
Justin Holewinski, Louis-Noël Pouchet, and P. Sadayappan. 2012. High-performance code generation for stencil computations on GPU architectures. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS’12). Association for Computing Machinery, New York, NY, 311–320. DOI:https://doi.org/10.1145/2304576.2304619Google Scholar
M. Kruse and H. Finkel. 2018. User-directed loop-transformations in Clang. In Proceedings of the IEEE/ACM 5th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC’18). 49–58. DOI:https://doi.org/10.1109/LLVM-HPC.2018.8639402Google ScholarCross Ref
Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’04). IEEE, 75–86.Google ScholarDigital Library
Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko. 2021. MLIR: Scaling compiler infrastructure for domain specific computation. In Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization (CGO’21). 2–14. DOI:https://doi.org/10.1109/CGO51591.2021.9370308Google ScholarDigital Library
Chris Leary and Todd Wang. 2017. XLA: TensorFlow, compiled. In Proceedings of the TensorFlow Dev Summit.Google Scholar
Roland Leißa, Klaas Boesche, Sebastian Hack, Arsène Pérard-Gayot, Richard Membarth, Philipp Slusallek, André Müller, and Bertil Schmidt. 2018. AnyDSL: A partial evaluation framework for programming high-performance libraries. Proc. ACM Program. Lang. 2, OOPSLA (Oct. 2018). DOI:https://doi.org/10.1145/3276489Google Scholar
Naoya Maruyama and Takayuki Aoki. 2014. Optimizing stencil computations for NVIDIA Kepler GPUs. In Proceedings of the 1st International Workshop on High-performance Stencil Computations. 89–95.Google Scholar
Kazuaki Matsumura, Hamid Reza Zohouri, Mohamed Wahib, Toshio Endo, and Satoshi Matsuoka. 2020. AN5D: Automated stencil framework for high-degree temporal blocking on GPUs. In Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization (CGO’20). Association for Computing Machinery, New York, NY, 199–211. DOI:https://doi.org/10.1145/3368826.3377904Google ScholarDigital Library
William M. McKeeman. 1965. Peephole optimization. Commun. ACM 8, 7 (1965), 443–444.Google ScholarDigital Library
G. A. McMechan. 1983. Migration by extrapolation of time-dependent boundary VALUES. Geophys. Prospect. 31 (June 1983), 413–420. DOI:https://doi.org/10.1111/j.1365-2478.1983.tb01060.xGoogle Scholar
Steven S. Muchnick. 1998. Advanced Compiler Design and Implementation. Morgan Kaufmann Publishers Inc., San Francisco, CA.Google ScholarDigital Library
Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. 2015. PolyMage: Automatic optimization for image processing pipelines. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’15). Association for Computing Machinery, New York, NY, 429–443. DOI:https://doi.org/10.1145/2694344.2694364Google ScholarDigital Library
Michel Müller and Takayuki Aoki. 2018. Hybrid Fortran: High productivity GPU porting framework applied to Japanese weather prediction model. In Accelerator Programming Using Directives, Sunita Chandrasekaran and Guido Juckeland (Eds.). Springer International Publishing, Cham, 20–41.Google Scholar
A. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey. 2010. 3.5-D Blocking optimization for stencil computations on modern CPUs and GPUs. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. 1–13.Google Scholar
John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. 2008. Scalable parallel programming with CUDA. Queue 6, 2 (Mar. 2008), 40–53. DOI:https://doi.org/10.1145/1365490.1365500Google ScholarDigital Library
Carlos Osuna, Tobias Wicky, Fabian Thuering, Torsten Hoefler, and Oliver Fuhrer. 2020. Dawn: A high-level domain-specific language compiler toolchain for weather and climate applications. Supercomput. Front. Innov. 7, 2 (2020).Google Scholar
Dan Quinlan. 2000. ROSE: Compiler support for object-oriented frameworks. Parallel Process. Lett. 10, 02n03 (2000), 215–226.Google ScholarCross Ref
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’13). Association for Computing Machinery, New York, NY, 519–530. DOI:https://doi.org/10.1145/2491956.2462176Google ScholarDigital Library
Prashant Rawat, Martin Kong, Tom Henretty, Justin Holewinski, Kevin Stock, Louis-Noël Pouchet, J. Ramanujam, Atanas Rountev, and P. Sadayappan. 2015. SDSLc: A multi-target domain-specific compiler for stencil computations. In Proceedings of the 5th International Workshop on Domain-specific Languages and High-level Frameworks for High-performance Computing (WOLFHPC’15). Association for Computing Machinery, New York, NY. DOI:https://doi.org/10.1145/2830018.2830025Google Scholar
Prashant Singh Rawat, Changwan Hong, Mahesh Ravishankar, Vinod Grover, Louis-Noël Pouchet, and P. Sadayappan. 2016. Effective resource management for enhancing performance of 2D and 3D stencils on GPUs. In Proceedings of the 9th Workshop on General Purpose Processing Using Graphics Processing Unit (GPGPU’16). Association for Computing Machinery, New York, NY, 92–102. DOI:https://doi.org/10.1145/2884045.2884047Google Scholar
Prashant Singh Rawat, Fabrice Rastello, Aravind Sukumaran-Rajam, Louis-Noël Pouchet, Atanas Rountev, and P. Sadayappan. 2018. Register optimizations for stencils on GPUs. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’18). Association for Computing Machinery, New York, NY, USA, 168–182. DOI:https://doi.org/10.1145/3178487.3178500Google Scholar
P. S. Rawat, M. Vaidya, A. Sukumaran-Rajam, M. Ravishankar, V. Grover, A. Rountev, L. Pouchet, and P. Sadayappan. 2018. Domain-specific optimization and generation of high-performance GPU code for stencil computations. Proc. IEEE 106, 11 (2018), 1902–1920.Google ScholarCross Ref
Tiark Rompf and Martin Odersky. 2010. Lightweight modular staging: A pragmatic approach to runtime code generation and compiled DSLs. In Proceedings of the 9th International Conference on Generative Programming and Component Engineering (GPCE’10). Association for Computing Machinery, New York, NY, 127–136. DOI:https://doi.org/10.1145/1868294.1868314Google ScholarDigital Library
Barry K. Rosen, Mark N. Wegman, and F. Kenneth Zadeck. 1988. Global value numbers and redundant computations. In Proceedings of the 15th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. 12–27.Google Scholar
Mohammed Sourouri, Scott B. Baden, and Xing Cai. 2017. Panda: A compiler framework for concurrent CPU+GPU execution of 3D stencil computations on GPU-accelerated supercomputers. Int. J. Parallel Program. 45, 3 (June 2017), 711–729. DOI:https://doi.org/10.1007/s10766-016-0454-1Google ScholarDigital Library
M. Steuwer, T. Remmelg, and C. Dubach. 2017. LIFT: A functional data-parallel IR for high-performance GPU code generation. In Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization (CGO’17). 74–85.Google Scholar
Arvind K. Sujeeth, Kevin J. Brown, Hyoukjoong Lee, Tiark Rompf, Hassan Chafi, Martin Odersky, and Kunle Olukotun. 2014. Delite: A compiler architecture for performance-oriented embedded domain-specific languages. ACM Trans. Embed. Comput. Syst. 13, 4s (April 2014). DOI:https://doi.org/10.1145/2584665Google ScholarDigital Library
Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. 2011. The Pochoir stencil compiler. In Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’11). Association for Computing Machinery, New York, NY, 117–128. DOI:https://doi.org/10.1145/1989493.1989508Google Scholar
Nicolas Vasilache, Cédric Bastoul, Albert Cohen, and Sylvain Girbal. 2006. Violated dependence analysis. In Proceedings of the 20th International Conference on Supercomputing. 335–344.Google ScholarDigital Library
Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral parallel code generation for CUDA. ACM Trans. Archit. Code Optim. 9, 4 (Jan. 2013). DOI:https://doi.org/10.1145/2400682.2400713Google ScholarDigital Library
M. Wahib and N. Maruyama. 2014. Scalable kernel fusion for memory-bound GPU applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 191–202.Google Scholar
C. Yount, J. Tobin, A. Breuer, and A. Duran. 2016. YASK—Yet Another Stencil Kernel: A framework for HPC stencil code-generation and tuning. In Proceedings of the 6th International Workshop on Domain-specific Languages and High-level Frameworks for High-performance Computing (WOLFHPC’16). 30–39.Google Scholar
Tuowen Zhao, Protonu Basu, Samuel Williams, Mary Hall, and Hans Johansen. 2019. Exploiting reuse and vectorization in blocked stencil computations on CPUs and GPUs. In Proceedings of the International Conference for High-performance Computing, Networking, Storage and Analysis (SC’19). Association for Computing Machinery, New York, NY. DOI:https://doi.org/10.1145/3295500.3356210Google ScholarDigital Library
Oleksandr Zinenko, Sven Verdoolaege, Chandan Reddy, Jun Shirako, Tobias Grosser, Vivek Sarkar, and Albert Cohen. 2018. Modeling the conflicting demands of parallelism and temporal/spatial locality in affine scheduling. In Proceedings of the 27th International Conference on Compiler Construction (CC’18). Association for Computing Machinery, New York, NY, 3–13. DOI:https://doi.org/10.1145/3178372.3179507Google ScholarDigital Library

Index Terms

Domain-Specific Multi-Level IR Rewriting for GPU: The Open Earth Compiler for GPU-accelerated Climate Simulation
1. Applied computing
  1. Physical sciences and engineering
    1. Earth and atmospheric sciences
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
    2. Context specific languages
      1. Domain specific languages

Recommendations

A shared compilation stack for distributed-memory parallelism in stencil DSLs
ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3

Domain Specific Languages (DSLs) increase programmer productivity and provide high performance. Their targeted abstractions allow scientists to express problems at a high level, providing rich details that optimizing compilers can exploit to target ...
Read More
Macros for domain-specific languages

Macros provide a powerful means of extending languages. They have proven useful in both general-purpose and domain-specific programming contexts. This paper presents an architecture for implementing macro-extensible DSLs on top of macro-extensible host ...
Read More
Declaratively defining domain-specific language debuggers
GCPE '11

Tool support is vital to the effectiveness of domain-specific languages. With language workbenches, domain-specific languages and their tool support can be generated from a combined, high-level specification. This paper shows how such a specification ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Architecture and Code Optimization Volume 18, Issue 4
December 2021
497 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3476575
Editor:
David Kaeli
Northeastern University, USA
Issue’s Table of Contents
Copyright © 2021 Owner/Author
This work is licensed under a Creative Commons Attribution International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 3 September 2021
- Accepted: 1 May 2021
- Revised: 1 April 2021
- Received: 1 December 2020
Published in taco Volume 18, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Weather and climate
intermediate representations
stencil computations
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 18
  Total Citations
  View Citations
- 2,918
  Total Downloads
- Downloads (Last 12 months)1,774
- Downloads (Last 6 weeks)216
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Domain-Specific Multi-Level IR Rewriting for GPU: The Open Earth Compiler for GPU-accelerated Climate Simulation

ACM Transactions on Architecture and Code Optimization

Abstract

References

Cited By

Index Terms

Recommendations

A shared compilation stack for distributed-memory parallelism in stencil DSLs

Macros for domain-specific languages

Declaratively defining domain-specific language debuggers