Abstract
Most compilers have a single core intermediate representation (IR) (e.g., LLVM) sometimes complemented with vaguely defined IR-like data structures. This IR is commonly low-level and close to machine instructions. As a result, optimizations relying on domain-specific information are either not possible or require complex analysis to recover the missing information. In contrast, multi-level rewriting instantiates a hierarchy of dialects (IRs), lowers programs level-by-level, and performs code transformations at the most suitable level. We demonstrate the effectiveness of this approach for the weather and climate domain. In particular, we develop a prototype compiler and design stencil- and GPU-specific dialects based on a set of newly introduced design principles. We find that two domain-specific optimizations (500 lines of code) realized on top of LLVM’s extensible MLIR compiler infrastructure suffice to outperform state-of-the-art solutions. In essence, multi-level rewriting promises to herald the age of specialized compilers composed from domain- and target-specific dialects implemented on top of a shared infrastructure.
- 2020. CLIMA. Retrieved from https://github.com/climate-machine/CLIMA/.Google Scholar
- 2020. Consortium for Small-scale Modeling. Retrieved from http://www.cosmo-model.org/.Google Scholar
- 2020. FV3: Finite-Volume Cubed-Sphere Dynamical Core. Retrieved from https://www.gfdl.noaa.gov/fv3/.Google Scholar
- 2020. GridTools. Retrieved from https://github.com/GridTools/gridtools.Google Scholar
- 2020. GT4Py. Retrieved from https://github.com/gridtools/gt4py.Google Scholar
- 2020. RAJA. Retrieved from https://github.com/LLNL/RAJA.Google Scholar
- Andrew Adams, Karima Ma, Luke Anderson, Riyadh Baghdadi, Tzu-Mao Li, Michaël Gharbi, Benoit Steiner, Steven Johnson, Kayvon Fatahalian, Frédo Durand, and Jonathan Ragan-Kelley. 2019. Learning to optimize halide with tree search and random programs. ACM Trans. Graph. 38, 4 (July 2019). DOI:https://doi.org/10.1145/3306346.3322967Google ScholarDigital Library
- S. V. Adams, R. W. Ford, M. Hambley, J. M. Hobson, I. Kavc̆ic̆, C. M. Maynard, T. Melvin, E. H. Müller, S. Mullerworth, A. R. Porter, M. Rezny, B. J. Shipway, and R. Wong. 2019. LFRic: Meeting the challenges of scalability and performance portability in weather and climate models. J. Parallel Distrib. Comput. 132 (2019), 383–396. DOI:https://doi.org/10.1016/j.jpdc.2019.02.007Google Scholar
- R. Baghdadi, U. Beaugnon, A. Cohen, T. Grosser, M. Kruse, C. Reddy, S. Verdoolaege, A. Betts, A. F. Donaldson, J. Ketema, J. Absar, S. v. Haastregt, A. Kravets, A. Lokhmotov, R. David, and E. Hajiyev. 2015. PENCIL: A platform-neutral compute intermediate language for accelerator programming. In Proceedings of the International Conference on Parallel Architecture and Compilation (PACT’15). 138–149.Google Scholar
- Michael Baldauf, Axel Seifert, Jochen Förstner, Detlev Majewski, Matthias Raschendorfer, and Thorsten Reinhardt. 2011. Operational convective-scale numerical weather prediction with the COSMO model: Description and sensitivities. Month. Weath. Rev. 139, 12 (2011), 3887–3905.Google ScholarCross Ref
- Ulysse Beaugnon, Antoine Pouille, Marc Pouzet, Jacques Pienaar, and Albert Cohen. 2017. Optimization space pruning without regrets. In Proceedings of the 26th International Conference on Compiler Construction. 34–44.Google ScholarDigital Library
- Tal Ben-Nun, Johannes de Fine Licht, Alexandros Nikolaos Ziogas, Timo Schneider, and Torsten Hoefler. 2019. Stateful dataflow multigraphs: A data-centric model for performance portability on heterogeneous architectures. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’19).Google ScholarDigital Library
- Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18). USENIX Association, 578–594. Retrieved from https://www.usenix.org/conference/osdi18/presentation/chen.Google Scholar
- Valentin Clement, Sylvaine Ferrachat, Oliver Fuhrer, Xavier Lapillonne, Carlos E. Osuna, Robert Pincus, Jon Rood, and William Sawyer. 2018. The CLAW DSL: Abstractions for performance portable weather and climate models. In Proceedings of the Platform for Advanced Scientific Computing Conference (PASC’18). Association for Computing Machinery, New York, NY. DOI:https://doi.org/10.1145/3218176.3218226Google ScholarDigital Library
- Zachary DeVito, James Hegarty, Alex Aiken, Pat Hanrahan, and Jan Vitek. 2013. Terra: A multi-stage language for high-performance computing. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’13). Association for Computing Machinery, New York, NY, 105–116. DOI:https://doi.org/10.1145/2491956.2462166Google ScholarDigital Library
- H. C. Edwards and C. R. Trott. 2013. Kokkos: Enabling performance portability across manycore architectures. In Proceedings of the Extreme Scaling Workshop (XSW’13). 18–24.Google Scholar
- Oliver Fuhrer, Carlos Osuna, Xavier Lapillonne, Tobias Gysi, Ben Cumming, Mauro Bianco, Andrea Arteaga, and Thomas Schulthess. 2014. Towards a performance portable, architecture agnostic implementation strategy for weather and climate models. Supercomput. Front. Innov. 1, 1 (2014).Google Scholar
- Tobias Grosser, Albert Cohen, Justin Holewinski, P. Sadayappan, and Sven Verdoolaege. 2014. Hybrid hexagonal/classical tiling for GPUs. In Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization (CGO’14). Association for Computing Machinery, New York, NY, 66–75. DOI:https://doi.org/10.1145/2544137.2544160Google ScholarDigital Library
- Tobias Grosser, Albert Cohen, Paul H. J. Kelly, J. Ramanujam, P. Sadayappan, and Sven Verdoolaege. 2013. Split tiling for GPUs: Automatic parallelization using trapezoidal tiles. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units (GPGPU’13). Association for Computing Machinery, New York, NY, 24–31. DOI:https://doi.org/10.1145/2458523.2458526Google ScholarDigital Library
- Tobias Grosser, Armin Groesslinger, and Christian Lengauer. 2012. Polly—Performing polyhedral optimizations on a low-level intermediate representation. Parallel Process. Lett. 22, 04 (2012), 1250010.Google ScholarCross Ref
- Tobias Grosser and Torsten Hoefler. 2016. Polly-ACC transparent compilation to heterogeneous hardware. In Proceedings of the International Conference on Supercomputing (ICS’16). Association for Computing Machinery, New York, NY. DOI:https://doi.org/10.1145/2925426.2926286Google ScholarDigital Library
- Tobias Gysi, Tobias Grosser, and Torsten Hoefler. 2015. MODESTO: Data-centric analytic optimization of complex stencil programs on heterogeneous architectures. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS’15). Association for Computing Machinery, New York, NY, 177–186. DOI:https://doi.org/10.1145/2751205.2751223Google ScholarDigital Library
- T. Gysi, T. Grosser, and T. Hoefler. 2019. Absinthe: Learning an analytical performance model to fuse and tile stencil codes in one shot. In Proceedings of the 28th International Conference on Parallel Architectures and Compilation Techniques (PACT’19). 370–382.Google Scholar
- Tobias Gysi, Carlos Osuna, Oliver Fuhrer, Mauro Bianco, and Thomas C. Schulthess. 2015. STELLA: A domain-specific tool for structured grid methods in weather and climate models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’15). Association for Computing Machinery, New York, NY. DOI:https://doi.org/10.1145/2807591.2807627Google Scholar
- Bastian Hagedorn, Larisa Stoltzfus, Michel Steuwer, Sergei Gorlatch, and Christophe Dubach. 2018. High performance stencil code generation with lift. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’18). Association for Computing Machinery, New York, NY, 100–112. DOI:https://doi.org/10.1145/3168824Google Scholar
- Lucas M. Harris and Shian-Jiann Lin. 2013. A two-way nested global-regional dynamical core on the cubed-sphere grid. Month. Weath. Rev. 141, 1 (2013), 283–306. DOI:https://doi.org/10.1175/MWR-D-11-00201.1Google ScholarCross Ref
- Justin Holewinski, Louis-Noël Pouchet, and P. Sadayappan. 2012. High-performance code generation for stencil computations on GPU architectures. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS’12). Association for Computing Machinery, New York, NY, 311–320. DOI:https://doi.org/10.1145/2304576.2304619Google Scholar
- M. Kruse and H. Finkel. 2018. User-directed loop-transformations in Clang. In Proceedings of the IEEE/ACM 5th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC’18). 49–58. DOI:https://doi.org/10.1109/LLVM-HPC.2018.8639402Google ScholarCross Ref
- Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’04). IEEE, 75–86.Google ScholarDigital Library
- Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko. 2021. MLIR: Scaling compiler infrastructure for domain specific computation. In Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization (CGO’21). 2–14. DOI:https://doi.org/10.1109/CGO51591.2021.9370308Google ScholarDigital Library
- Chris Leary and Todd Wang. 2017. XLA: TensorFlow, compiled. In Proceedings of the TensorFlow Dev Summit.Google Scholar
- Roland Leißa, Klaas Boesche, Sebastian Hack, Arsène Pérard-Gayot, Richard Membarth, Philipp Slusallek, André Müller, and Bertil Schmidt. 2018. AnyDSL: A partial evaluation framework for programming high-performance libraries. Proc. ACM Program. Lang. 2, OOPSLA (Oct. 2018). DOI:https://doi.org/10.1145/3276489Google Scholar
- Naoya Maruyama and Takayuki Aoki. 2014. Optimizing stencil computations for NVIDIA Kepler GPUs. In Proceedings of the 1st International Workshop on High-performance Stencil Computations. 89–95.Google Scholar
- Kazuaki Matsumura, Hamid Reza Zohouri, Mohamed Wahib, Toshio Endo, and Satoshi Matsuoka. 2020. AN5D: Automated stencil framework for high-degree temporal blocking on GPUs. In Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization (CGO’20). Association for Computing Machinery, New York, NY, 199–211. DOI:https://doi.org/10.1145/3368826.3377904Google ScholarDigital Library
- William M. McKeeman. 1965. Peephole optimization. Commun. ACM 8, 7 (1965), 443–444.Google ScholarDigital Library
- G. A. McMechan. 1983. Migration by extrapolation of time-dependent boundary VALUES. Geophys. Prospect. 31 (June 1983), 413–420. DOI:https://doi.org/10.1111/j.1365-2478.1983.tb01060.xGoogle Scholar
- Steven S. Muchnick. 1998. Advanced Compiler Design and Implementation. Morgan Kaufmann Publishers Inc., San Francisco, CA.Google ScholarDigital Library
- Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. 2015. PolyMage: Automatic optimization for image processing pipelines. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’15). Association for Computing Machinery, New York, NY, 429–443. DOI:https://doi.org/10.1145/2694344.2694364Google ScholarDigital Library
- Michel Müller and Takayuki Aoki. 2018. Hybrid Fortran: High productivity GPU porting framework applied to Japanese weather prediction model. In Accelerator Programming Using Directives, Sunita Chandrasekaran and Guido Juckeland (Eds.). Springer International Publishing, Cham, 20–41.Google Scholar
- A. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey. 2010. 3.5-D Blocking optimization for stencil computations on modern CPUs and GPUs. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. 1–13.Google Scholar
- John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. 2008. Scalable parallel programming with CUDA. Queue 6, 2 (Mar. 2008), 40–53. DOI:https://doi.org/10.1145/1365490.1365500Google ScholarDigital Library
- Carlos Osuna, Tobias Wicky, Fabian Thuering, Torsten Hoefler, and Oliver Fuhrer. 2020. Dawn: A high-level domain-specific language compiler toolchain for weather and climate applications. Supercomput. Front. Innov. 7, 2 (2020).Google Scholar
- Dan Quinlan. 2000. ROSE: Compiler support for object-oriented frameworks. Parallel Process. Lett. 10, 02n03 (2000), 215–226.Google ScholarCross Ref
- Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’13). Association for Computing Machinery, New York, NY, 519–530. DOI:https://doi.org/10.1145/2491956.2462176Google ScholarDigital Library
- Prashant Rawat, Martin Kong, Tom Henretty, Justin Holewinski, Kevin Stock, Louis-Noël Pouchet, J. Ramanujam, Atanas Rountev, and P. Sadayappan. 2015. SDSLc: A multi-target domain-specific compiler for stencil computations. In Proceedings of the 5th International Workshop on Domain-specific Languages and High-level Frameworks for High-performance Computing (WOLFHPC’15). Association for Computing Machinery, New York, NY. DOI:https://doi.org/10.1145/2830018.2830025Google Scholar
- Prashant Singh Rawat, Changwan Hong, Mahesh Ravishankar, Vinod Grover, Louis-Noël Pouchet, and P. Sadayappan. 2016. Effective resource management for enhancing performance of 2D and 3D stencils on GPUs. In Proceedings of the 9th Workshop on General Purpose Processing Using Graphics Processing Unit (GPGPU’16). Association for Computing Machinery, New York, NY, 92–102. DOI:https://doi.org/10.1145/2884045.2884047Google Scholar
- Prashant Singh Rawat, Fabrice Rastello, Aravind Sukumaran-Rajam, Louis-Noël Pouchet, Atanas Rountev, and P. Sadayappan. 2018. Register optimizations for stencils on GPUs. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’18). Association for Computing Machinery, New York, NY, USA, 168–182. DOI:https://doi.org/10.1145/3178487.3178500Google Scholar
- P. S. Rawat, M. Vaidya, A. Sukumaran-Rajam, M. Ravishankar, V. Grover, A. Rountev, L. Pouchet, and P. Sadayappan. 2018. Domain-specific optimization and generation of high-performance GPU code for stencil computations. Proc. IEEE 106, 11 (2018), 1902–1920.Google ScholarCross Ref
- Tiark Rompf and Martin Odersky. 2010. Lightweight modular staging: A pragmatic approach to runtime code generation and compiled DSLs. In Proceedings of the 9th International Conference on Generative Programming and Component Engineering (GPCE’10). Association for Computing Machinery, New York, NY, 127–136. DOI:https://doi.org/10.1145/1868294.1868314Google ScholarDigital Library
- Barry K. Rosen, Mark N. Wegman, and F. Kenneth Zadeck. 1988. Global value numbers and redundant computations. In Proceedings of the 15th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. 12–27.Google Scholar
- Mohammed Sourouri, Scott B. Baden, and Xing Cai. 2017. Panda: A compiler framework for concurrent CPU+GPU execution of 3D stencil computations on GPU-accelerated supercomputers. Int. J. Parallel Program. 45, 3 (June 2017), 711–729. DOI:https://doi.org/10.1007/s10766-016-0454-1Google ScholarDigital Library
- M. Steuwer, T. Remmelg, and C. Dubach. 2017. LIFT: A functional data-parallel IR for high-performance GPU code generation. In Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization (CGO’17). 74–85.Google Scholar
- Arvind K. Sujeeth, Kevin J. Brown, Hyoukjoong Lee, Tiark Rompf, Hassan Chafi, Martin Odersky, and Kunle Olukotun. 2014. Delite: A compiler architecture for performance-oriented embedded domain-specific languages. ACM Trans. Embed. Comput. Syst. 13, 4s (April 2014). DOI:https://doi.org/10.1145/2584665Google ScholarDigital Library
- Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. 2011. The Pochoir stencil compiler. In Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’11). Association for Computing Machinery, New York, NY, 117–128. DOI:https://doi.org/10.1145/1989493.1989508Google Scholar
- Nicolas Vasilache, Cédric Bastoul, Albert Cohen, and Sylvain Girbal. 2006. Violated dependence analysis. In Proceedings of the 20th International Conference on Supercomputing. 335–344.Google ScholarDigital Library
- Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral parallel code generation for CUDA. ACM Trans. Archit. Code Optim. 9, 4 (Jan. 2013). DOI:https://doi.org/10.1145/2400682.2400713Google ScholarDigital Library
- M. Wahib and N. Maruyama. 2014. Scalable kernel fusion for memory-bound GPU applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 191–202.Google Scholar
- C. Yount, J. Tobin, A. Breuer, and A. Duran. 2016. YASK—Yet Another Stencil Kernel: A framework for HPC stencil code-generation and tuning. In Proceedings of the 6th International Workshop on Domain-specific Languages and High-level Frameworks for High-performance Computing (WOLFHPC’16). 30–39.Google Scholar
- Tuowen Zhao, Protonu Basu, Samuel Williams, Mary Hall, and Hans Johansen. 2019. Exploiting reuse and vectorization in blocked stencil computations on CPUs and GPUs. In Proceedings of the International Conference for High-performance Computing, Networking, Storage and Analysis (SC’19). Association for Computing Machinery, New York, NY. DOI:https://doi.org/10.1145/3295500.3356210Google ScholarDigital Library
- Oleksandr Zinenko, Sven Verdoolaege, Chandan Reddy, Jun Shirako, Tobias Grosser, Vivek Sarkar, and Albert Cohen. 2018. Modeling the conflicting demands of parallelism and temporal/spatial locality in affine scheduling. In Proceedings of the 27th International Conference on Compiler Construction (CC’18). Association for Computing Machinery, New York, NY, 3–13. DOI:https://doi.org/10.1145/3178372.3179507Google ScholarDigital Library
Index Terms
- Domain-Specific Multi-Level IR Rewriting for GPU: The Open Earth Compiler for GPU-accelerated Climate Simulation
Recommendations
A shared compilation stack for distributed-memory parallelism in stencil DSLs
ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3Domain Specific Languages (DSLs) increase programmer productivity and provide high performance. Their targeted abstractions allow scientists to express problems at a high level, providing rich details that optimizing compilers can exploit to target ...
Macros for domain-specific languages
Macros provide a powerful means of extending languages. They have proven useful in both general-purpose and domain-specific programming contexts. This paper presents an architecture for implementing macro-extensible DSLs on top of macro-extensible host ...
Declaratively defining domain-specific language debuggers
GCPE '11Tool support is vital to the effectiveness of domain-specific languages. With language workbenches, domain-specific languages and their tool support can be generated from a combined, high-level specification. This paper shows how such a specification ...
Comments