Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open Access

Domain-Specific Multi-Level IR Rewriting for GPU: The Open Earth Compiler for GPU-accelerated Climate Simulation

Published:03 September 2021Publication History
Skip Abstract Section

Abstract

Most compilers have a single core intermediate representation (IR) (e.g., LLVM) sometimes complemented with vaguely defined IR-like data structures. This IR is commonly low-level and close to machine instructions. As a result, optimizations relying on domain-specific information are either not possible or require complex analysis to recover the missing information. In contrast, multi-level rewriting instantiates a hierarchy of dialects (IRs), lowers programs level-by-level, and performs code transformations at the most suitable level. We demonstrate the effectiveness of this approach for the weather and climate domain. In particular, we develop a prototype compiler and design stencil- and GPU-specific dialects based on a set of newly introduced design principles. We find that two domain-specific optimizations (500 lines of code) realized on top of LLVM’s extensible MLIR compiler infrastructure suffice to outperform state-of-the-art solutions. In essence, multi-level rewriting promises to herald the age of specialized compilers composed from domain- and target-specific dialects implemented on top of a shared infrastructure.

References

  1. 2020. CLIMA. Retrieved from https://github.com/climate-machine/CLIMA/.Google ScholarGoogle Scholar
  2. 2020. Consortium for Small-scale Modeling. Retrieved from http://www.cosmo-model.org/.Google ScholarGoogle Scholar
  3. 2020. FV3: Finite-Volume Cubed-Sphere Dynamical Core. Retrieved from https://www.gfdl.noaa.gov/fv3/.Google ScholarGoogle Scholar
  4. 2020. GridTools. Retrieved from https://github.com/GridTools/gridtools.Google ScholarGoogle Scholar
  5. 2020. GT4Py. Retrieved from https://github.com/gridtools/gt4py.Google ScholarGoogle Scholar
  6. 2020. RAJA. Retrieved from https://github.com/LLNL/RAJA.Google ScholarGoogle Scholar
  7. Andrew Adams, Karima Ma, Luke Anderson, Riyadh Baghdadi, Tzu-Mao Li, Michaël Gharbi, Benoit Steiner, Steven Johnson, Kayvon Fatahalian, Frédo Durand, and Jonathan Ragan-Kelley. 2019. Learning to optimize halide with tree search and random programs. ACM Trans. Graph. 38, 4 (July 2019). DOI:https://doi.org/10.1145/3306346.3322967Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. V. Adams, R. W. Ford, M. Hambley, J. M. Hobson, I. Kavc̆ic̆, C. M. Maynard, T. Melvin, E. H. Müller, S. Mullerworth, A. R. Porter, M. Rezny, B. J. Shipway, and R. Wong. 2019. LFRic: Meeting the challenges of scalability and performance portability in weather and climate models. J. Parallel Distrib. Comput. 132 (2019), 383–396. DOI:https://doi.org/10.1016/j.jpdc.2019.02.007Google ScholarGoogle Scholar
  9. R. Baghdadi, U. Beaugnon, A. Cohen, T. Grosser, M. Kruse, C. Reddy, S. Verdoolaege, A. Betts, A. F. Donaldson, J. Ketema, J. Absar, S. v. Haastregt, A. Kravets, A. Lokhmotov, R. David, and E. Hajiyev. 2015. PENCIL: A platform-neutral compute intermediate language for accelerator programming. In Proceedings of the International Conference on Parallel Architecture and Compilation (PACT’15). 138–149.Google ScholarGoogle Scholar
  10. Michael Baldauf, Axel Seifert, Jochen Förstner, Detlev Majewski, Matthias Raschendorfer, and Thorsten Reinhardt. 2011. Operational convective-scale numerical weather prediction with the COSMO model: Description and sensitivities. Month. Weath. Rev. 139, 12 (2011), 3887–3905.Google ScholarGoogle ScholarCross RefCross Ref
  11. Ulysse Beaugnon, Antoine Pouille, Marc Pouzet, Jacques Pienaar, and Albert Cohen. 2017. Optimization space pruning without regrets. In Proceedings of the 26th International Conference on Compiler Construction. 34–44.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Tal Ben-Nun, Johannes de Fine Licht, Alexandros Nikolaos Ziogas, Timo Schneider, and Torsten Hoefler. 2019. Stateful dataflow multigraphs: A data-centric model for performance portability on heterogeneous architectures. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’19).Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18). USENIX Association, 578–594. Retrieved from https://www.usenix.org/conference/osdi18/presentation/chen.Google ScholarGoogle Scholar
  14. Valentin Clement, Sylvaine Ferrachat, Oliver Fuhrer, Xavier Lapillonne, Carlos E. Osuna, Robert Pincus, Jon Rood, and William Sawyer. 2018. The CLAW DSL: Abstractions for performance portable weather and climate models. In Proceedings of the Platform for Advanced Scientific Computing Conference (PASC’18). Association for Computing Machinery, New York, NY. DOI:https://doi.org/10.1145/3218176.3218226Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Zachary DeVito, James Hegarty, Alex Aiken, Pat Hanrahan, and Jan Vitek. 2013. Terra: A multi-stage language for high-performance computing. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’13). Association for Computing Machinery, New York, NY, 105–116. DOI:https://doi.org/10.1145/2491956.2462166Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. H. C. Edwards and C. R. Trott. 2013. Kokkos: Enabling performance portability across manycore architectures. In Proceedings of the Extreme Scaling Workshop (XSW’13). 18–24.Google ScholarGoogle Scholar
  17. Oliver Fuhrer, Carlos Osuna, Xavier Lapillonne, Tobias Gysi, Ben Cumming, Mauro Bianco, Andrea Arteaga, and Thomas Schulthess. 2014. Towards a performance portable, architecture agnostic implementation strategy for weather and climate models. Supercomput. Front. Innov. 1, 1 (2014).Google ScholarGoogle Scholar
  18. Tobias Grosser, Albert Cohen, Justin Holewinski, P. Sadayappan, and Sven Verdoolaege. 2014. Hybrid hexagonal/classical tiling for GPUs. In Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization (CGO’14). Association for Computing Machinery, New York, NY, 66–75. DOI:https://doi.org/10.1145/2544137.2544160Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Tobias Grosser, Albert Cohen, Paul H. J. Kelly, J. Ramanujam, P. Sadayappan, and Sven Verdoolaege. 2013. Split tiling for GPUs: Automatic parallelization using trapezoidal tiles. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units (GPGPU’13). Association for Computing Machinery, New York, NY, 24–31. DOI:https://doi.org/10.1145/2458523.2458526Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Tobias Grosser, Armin Groesslinger, and Christian Lengauer. 2012. Polly—Performing polyhedral optimizations on a low-level intermediate representation. Parallel Process. Lett. 22, 04 (2012), 1250010.Google ScholarGoogle ScholarCross RefCross Ref
  21. Tobias Grosser and Torsten Hoefler. 2016. Polly-ACC transparent compilation to heterogeneous hardware. In Proceedings of the International Conference on Supercomputing (ICS’16). Association for Computing Machinery, New York, NY. DOI:https://doi.org/10.1145/2925426.2926286Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Tobias Gysi, Tobias Grosser, and Torsten Hoefler. 2015. MODESTO: Data-centric analytic optimization of complex stencil programs on heterogeneous architectures. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS’15). Association for Computing Machinery, New York, NY, 177–186. DOI:https://doi.org/10.1145/2751205.2751223Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. T. Gysi, T. Grosser, and T. Hoefler. 2019. Absinthe: Learning an analytical performance model to fuse and tile stencil codes in one shot. In Proceedings of the 28th International Conference on Parallel Architectures and Compilation Techniques (PACT’19). 370–382.Google ScholarGoogle Scholar
  24. Tobias Gysi, Carlos Osuna, Oliver Fuhrer, Mauro Bianco, and Thomas C. Schulthess. 2015. STELLA: A domain-specific tool for structured grid methods in weather and climate models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’15). Association for Computing Machinery, New York, NY. DOI:https://doi.org/10.1145/2807591.2807627Google ScholarGoogle Scholar
  25. Bastian Hagedorn, Larisa Stoltzfus, Michel Steuwer, Sergei Gorlatch, and Christophe Dubach. 2018. High performance stencil code generation with lift. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’18). Association for Computing Machinery, New York, NY, 100–112. DOI:https://doi.org/10.1145/3168824Google ScholarGoogle Scholar
  26. Lucas M. Harris and Shian-Jiann Lin. 2013. A two-way nested global-regional dynamical core on the cubed-sphere grid. Month. Weath. Rev. 141, 1 (2013), 283–306. DOI:https://doi.org/10.1175/MWR-D-11-00201.1Google ScholarGoogle ScholarCross RefCross Ref
  27. Justin Holewinski, Louis-Noël Pouchet, and P. Sadayappan. 2012. High-performance code generation for stencil computations on GPU architectures. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS’12). Association for Computing Machinery, New York, NY, 311–320. DOI:https://doi.org/10.1145/2304576.2304619Google ScholarGoogle Scholar
  28. M. Kruse and H. Finkel. 2018. User-directed loop-transformations in Clang. In Proceedings of the IEEE/ACM 5th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC’18). 49–58. DOI:https://doi.org/10.1109/LLVM-HPC.2018.8639402Google ScholarGoogle ScholarCross RefCross Ref
  29. Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’04). IEEE, 75–86.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko. 2021. MLIR: Scaling compiler infrastructure for domain specific computation. In Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization (CGO’21). 2–14. DOI:https://doi.org/10.1109/CGO51591.2021.9370308Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Chris Leary and Todd Wang. 2017. XLA: TensorFlow, compiled. In Proceedings of the TensorFlow Dev Summit.Google ScholarGoogle Scholar
  32. Roland Leißa, Klaas Boesche, Sebastian Hack, Arsène Pérard-Gayot, Richard Membarth, Philipp Slusallek, André Müller, and Bertil Schmidt. 2018. AnyDSL: A partial evaluation framework for programming high-performance libraries. Proc. ACM Program. Lang. 2, OOPSLA (Oct. 2018). DOI:https://doi.org/10.1145/3276489Google ScholarGoogle Scholar
  33. Naoya Maruyama and Takayuki Aoki. 2014. Optimizing stencil computations for NVIDIA Kepler GPUs. In Proceedings of the 1st International Workshop on High-performance Stencil Computations. 89–95.Google ScholarGoogle Scholar
  34. Kazuaki Matsumura, Hamid Reza Zohouri, Mohamed Wahib, Toshio Endo, and Satoshi Matsuoka. 2020. AN5D: Automated stencil framework for high-degree temporal blocking on GPUs. In Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization (CGO’20). Association for Computing Machinery, New York, NY, 199–211. DOI:https://doi.org/10.1145/3368826.3377904Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. William M. McKeeman. 1965. Peephole optimization. Commun. ACM 8, 7 (1965), 443–444.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. G. A. McMechan. 1983. Migration by extrapolation of time-dependent boundary VALUES. Geophys. Prospect. 31 (June 1983), 413–420. DOI:https://doi.org/10.1111/j.1365-2478.1983.tb01060.xGoogle ScholarGoogle Scholar
  37. Steven S. Muchnick. 1998. Advanced Compiler Design and Implementation. Morgan Kaufmann Publishers Inc., San Francisco, CA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. 2015. PolyMage: Automatic optimization for image processing pipelines. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’15). Association for Computing Machinery, New York, NY, 429–443. DOI:https://doi.org/10.1145/2694344.2694364Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Michel Müller and Takayuki Aoki. 2018. Hybrid Fortran: High productivity GPU porting framework applied to Japanese weather prediction model. In Accelerator Programming Using Directives, Sunita Chandrasekaran and Guido Juckeland (Eds.). Springer International Publishing, Cham, 20–41.Google ScholarGoogle Scholar
  40. A. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey. 2010. 3.5-D Blocking optimization for stencil computations on modern CPUs and GPUs. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. 1–13.Google ScholarGoogle Scholar
  41. John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. 2008. Scalable parallel programming with CUDA. Queue 6, 2 (Mar. 2008), 40–53. DOI:https://doi.org/10.1145/1365490.1365500Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Carlos Osuna, Tobias Wicky, Fabian Thuering, Torsten Hoefler, and Oliver Fuhrer. 2020. Dawn: A high-level domain-specific language compiler toolchain for weather and climate applications. Supercomput. Front. Innov. 7, 2 (2020).Google ScholarGoogle Scholar
  43. Dan Quinlan. 2000. ROSE: Compiler support for object-oriented frameworks. Parallel Process. Lett. 10, 02n03 (2000), 215–226.Google ScholarGoogle ScholarCross RefCross Ref
  44. Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’13). Association for Computing Machinery, New York, NY, 519–530. DOI:https://doi.org/10.1145/2491956.2462176Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Prashant Rawat, Martin Kong, Tom Henretty, Justin Holewinski, Kevin Stock, Louis-Noël Pouchet, J. Ramanujam, Atanas Rountev, and P. Sadayappan. 2015. SDSLc: A multi-target domain-specific compiler for stencil computations. In Proceedings of the 5th International Workshop on Domain-specific Languages and High-level Frameworks for High-performance Computing (WOLFHPC’15). Association for Computing Machinery, New York, NY. DOI:https://doi.org/10.1145/2830018.2830025Google ScholarGoogle Scholar
  46. Prashant Singh Rawat, Changwan Hong, Mahesh Ravishankar, Vinod Grover, Louis-Noël Pouchet, and P. Sadayappan. 2016. Effective resource management for enhancing performance of 2D and 3D stencils on GPUs. In Proceedings of the 9th Workshop on General Purpose Processing Using Graphics Processing Unit (GPGPU’16). Association for Computing Machinery, New York, NY, 92–102. DOI:https://doi.org/10.1145/2884045.2884047Google ScholarGoogle Scholar
  47. Prashant Singh Rawat, Fabrice Rastello, Aravind Sukumaran-Rajam, Louis-Noël Pouchet, Atanas Rountev, and P. Sadayappan. 2018. Register optimizations for stencils on GPUs. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’18). Association for Computing Machinery, New York, NY, USA, 168–182. DOI:https://doi.org/10.1145/3178487.3178500Google ScholarGoogle Scholar
  48. P. S. Rawat, M. Vaidya, A. Sukumaran-Rajam, M. Ravishankar, V. Grover, A. Rountev, L. Pouchet, and P. Sadayappan. 2018. Domain-specific optimization and generation of high-performance GPU code for stencil computations. Proc. IEEE 106, 11 (2018), 1902–1920.Google ScholarGoogle ScholarCross RefCross Ref
  49. Tiark Rompf and Martin Odersky. 2010. Lightweight modular staging: A pragmatic approach to runtime code generation and compiled DSLs. In Proceedings of the 9th International Conference on Generative Programming and Component Engineering (GPCE’10). Association for Computing Machinery, New York, NY, 127–136. DOI:https://doi.org/10.1145/1868294.1868314Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Barry K. Rosen, Mark N. Wegman, and F. Kenneth Zadeck. 1988. Global value numbers and redundant computations. In Proceedings of the 15th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. 12–27.Google ScholarGoogle Scholar
  51. Mohammed Sourouri, Scott B. Baden, and Xing Cai. 2017. Panda: A compiler framework for concurrent CPU+GPU execution of 3D stencil computations on GPU-accelerated supercomputers. Int. J. Parallel Program. 45, 3 (June 2017), 711–729. DOI:https://doi.org/10.1007/s10766-016-0454-1Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. M. Steuwer, T. Remmelg, and C. Dubach. 2017. LIFT: A functional data-parallel IR for high-performance GPU code generation. In Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization (CGO’17). 74–85.Google ScholarGoogle Scholar
  53. Arvind K. Sujeeth, Kevin J. Brown, Hyoukjoong Lee, Tiark Rompf, Hassan Chafi, Martin Odersky, and Kunle Olukotun. 2014. Delite: A compiler architecture for performance-oriented embedded domain-specific languages. ACM Trans. Embed. Comput. Syst. 13, 4s (April 2014). DOI:https://doi.org/10.1145/2584665Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. 2011. The Pochoir stencil compiler. In Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’11). Association for Computing Machinery, New York, NY, 117–128. DOI:https://doi.org/10.1145/1989493.1989508Google ScholarGoogle Scholar
  55. Nicolas Vasilache, Cédric Bastoul, Albert Cohen, and Sylvain Girbal. 2006. Violated dependence analysis. In Proceedings of the 20th International Conference on Supercomputing. 335–344.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral parallel code generation for CUDA. ACM Trans. Archit. Code Optim. 9, 4 (Jan. 2013). DOI:https://doi.org/10.1145/2400682.2400713Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. M. Wahib and N. Maruyama. 2014. Scalable kernel fusion for memory-bound GPU applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 191–202.Google ScholarGoogle Scholar
  58. C. Yount, J. Tobin, A. Breuer, and A. Duran. 2016. YASK—Yet Another Stencil Kernel: A framework for HPC stencil code-generation and tuning. In Proceedings of the 6th International Workshop on Domain-specific Languages and High-level Frameworks for High-performance Computing (WOLFHPC’16). 30–39.Google ScholarGoogle Scholar
  59. Tuowen Zhao, Protonu Basu, Samuel Williams, Mary Hall, and Hans Johansen. 2019. Exploiting reuse and vectorization in blocked stencil computations on CPUs and GPUs. In Proceedings of the International Conference for High-performance Computing, Networking, Storage and Analysis (SC’19). Association for Computing Machinery, New York, NY. DOI:https://doi.org/10.1145/3295500.3356210Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Oleksandr Zinenko, Sven Verdoolaege, Chandan Reddy, Jun Shirako, Tobias Grosser, Vivek Sarkar, and Albert Cohen. 2018. Modeling the conflicting demands of parallelism and temporal/spatial locality in affine scheduling. In Proceedings of the 27th International Conference on Compiler Construction (CC’18). Association for Computing Machinery, New York, NY, 3–13. DOI:https://doi.org/10.1145/3178372.3179507Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Domain-Specific Multi-Level IR Rewriting for GPU: The Open Earth Compiler for GPU-accelerated Climate Simulation

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Architecture and Code Optimization
          ACM Transactions on Architecture and Code Optimization  Volume 18, Issue 4
          December 2021
          497 pages
          ISSN:1544-3566
          EISSN:1544-3973
          DOI:10.1145/3476575
          Issue’s Table of Contents

          Copyright © 2021 Owner/Author

          This work is licensed under a Creative Commons Attribution International 4.0 License.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 3 September 2021
          • Accepted: 1 May 2021
          • Revised: 1 April 2021
          • Received: 1 December 2020
          Published in taco Volume 18, Issue 4

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format