Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3366428.3380769acmconferencesArticle/Chapter ViewAbstractPublication PagesgpgpuConference Proceedingsconference-collections
research-article

High-level hardware feature extraction for GPU performance prediction of stencils

Authors Info & Claims
Published:23 February 2020Publication History

ABSTRACT

High-level functional programming abstractions have started to show promising results for HPC (High-Performance Computing). Approaches such as Lift, Futhark or Delite have shown that it is possible to have both, high-level abstractions and performance, even for HPC workloads such as stencils. In addition, these high-level functional abstractions can also be used to represent programs and their optimized variants, within the compiler itself. However, such high-level approaches rely heavily on the compiler to optimize programs which is notoriously hard when targeting GPUs.

Compilers either use hand-crafted heuristics to direct the optimizations or iterative compilation to search the optimization space. The first approach has fast compile times, however, it is not performance-portable across different devices and requires a lot of human effort to build the heuristics. Iterative compilation, on the other hand, has the ability to search the optimization space automatically and adapts to different devices. However, this process is often very time-consuming as thousands of variants have to be evaluated. Performance models based on statistical techniques have been proposed to speedup the optimization space exploration. However, they rely on low-level hardware features, in the form of performance counters or low-level static code features.

Using the Lift framework, this paper demonstrates how low-level, GPU-specific features are extractable directly from a high-level functional representation. The Lift IR (Intermediate Representation) is in fact a very suitable choice since all optimization choices are exposed at the IR level. This paper shows how to extract low-level features such as number of unique cache lines accessed per warp, which is crucial for building accurate GPU performance models. Using this approach, we are able to speedup the exploration of the space by a factor 2000x on an AMD GPU and 450x on Nvidia on average across many stencil applications.

References

  1. Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O'Reilly, and Saman P. Amarasinghe. 2014. Open-Tuner: an extensible framework for program autotuning. In PACT. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In ISPASS. IEEE. Google ScholarGoogle ScholarCross RefCross Ref
  3. Ulysse Beaugnon, Antoine Pouille, Marc Pouzet, Jacques Pienaar, and Albert Cohen. 2017. Optimization Space Pruning Without Regrets. In CC. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Kevin J. Brown, Arvind K. Sujeeth, Hyouk Joong Lee, Tiark Rompf, Hassan Chafi, Martin Odersky, and Kunle Olukotun. 2011. A Heterogeneous Parallel Framework for Domain-Specific Languages. In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT '11).Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 578--594.Google ScholarGoogle Scholar
  6. Alexander Collins, Christian Fensch, Hugh Leather, and Murray Cole. 2013. MaSiF: Machine learning guided auto-tuning of parallel skeletons. In HiPC. IEEE. Google ScholarGoogle ScholarCross RefCross Ref
  7. John Demme and Simha Sethumadhavan. 2012. Approximate graph clustering for program characterization. ACM TACO 8, 4 (2012), 21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Christophe Dubach, John Cavazos, Björn Franke, Grigori Fursin, Michael F. P. O'Boyle, and Olivier Temam. 2007. Fast compiler optimisation evaluation using code-feature based performance prediction. In CF. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Bastian Hagedorn, Larisa Stoltzfus, Michel Steuwer, Sergei Gorlatch, and Christophe Dubach. 2018. High Performance Stencil Code Generation with Lift. In CGO. ACM, New York, NY, USA, 100--112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Troels Henriksen, Niels G. W. Serup, Martin Elsman, Fritz Henglein, and Cosmin E. Oancea. 2017. Futhark: Purely Functional GPU-programming with Nested Parallelism and In-place Array Updates. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2017).Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Changwan Hong, Aravind Sukumaran-Rajam, Jinsung Kim, Prashant Singh Rawat, Sriram Krishnamoorthy, Louis-Noël Pouchet, Fabrice Rastello, and P. Sadayappan. 2018. GPU Code Optimization Using Abstract Kernel Emulation and Sensitivity Analysis. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2018). ACM, New York, NY, USA, 736--751. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Sunpyo Hong and Hyesoon Kim. 2009. An Analytical Model for a GPU Architecture with Memory-level and Thread-level Parallelism Awareness. In ISCA. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Wenhao Jia, Kelly A. Shaw, and Margaret Martonosi. 2012. Stargazer: Automated regression-based GPU design space exploration. In ISPASS, Rajeev Balasubramonian and Vijayalakshmi Srinivasan (Eds.). IEEE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Wenhao Jia, Kelly A. Shaw, and Margaret Martonosi. 2013. Starchart: Hardware and software optimization using recursive partitioning regression trees. In PACT. IEEE. Google ScholarGoogle ScholarCross RefCross Ref
  15. Andrew Kerr, Gregory Diamos, and Sudhakar Yalamanchili. 2010. Modeling GPU-CPU Workloads and Systems. In GPGPU. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Yooseong Kim and Aviral Shrivastava. 2011. CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA. In DAC. ACM, 6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Hugh Leather, Edwin V. Bonilla, and Michael F. P. O'Boyle. 2009. Automatic Feature Generation for Machine Learning Based Optimizing Compilation. In CGO. IEEE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Seyong Lee, Jeremy S. Meredith, and Jeffrey S. Vetter. 2015. COMPASS: A Framework for Automated Performance Modeling and Prediction. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS '15). ACM, 10.Google ScholarGoogle Scholar
  19. Roland Leissa, Klaas Boesche, Sebastian Hack, Arsène Pérard-Gayot, Richard Membarth, Philipp Slusallek, André Müller, and Bertil Schmidt. 2018. AnyDSL: A Partial Evaluation Framework for Programming High-performance Libraries. Proc. ACM Program. Lang. 2, OOPSLA, Article 119 (Oct. 2018), 30 pages.Google ScholarGoogle Scholar
  20. Souley Madougou, Ana Lucia Varbanescu, and Cees de Laat. 2016. Using Colored Petri Nets for GPGPU Performance Modeling. In CF. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Alberto Magni, Christophe Dubach, and Michael F. P. O'Boyle. 2014. Automatic optimization of thread-coarsening for graphics processors. In PACT. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Trevor L. McDonell, Manuel M T Chakravarty, Gabriele Keller, and Ben Lippmeier. 2013. Optimising Purely Functional GPU Programs. In ICFP '13: The 18th ACM SIGPLAN International Conference on Functional Programming. ACM.Google ScholarGoogle Scholar
  23. Jiayuan Meng, Vitali A. Morozov, Kalyan Kumaran, Venkatram Vishwanath, and Thomas D. Uram. 2011. GROPHECY: GPU performance projection from CPU code skeletons. In SC. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Mircea Namolaru, Albert Cohen, Grigori Fursin, Ayal Zaks, and Ari Freund. 2010. Practical Aggregation of Semantical Program Properties for Machine Learning Based Optimization. In CASES. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Cedric Nugteren and Valeriu Codreanu. 2015. CLTune: A Generic Auto-Tuner for OpenCL Kernels. In MCSoC. IEEE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Cedric Nugteren and Henk Corporaal. 2012. The boat hull model: enabling performance prediction for parallel computing prior to code development. In CF. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal, and Henri E. Bal. 2014. A detailed GPU cache model based on reuse distance theory. In HPCA. IEEE. Google ScholarGoogle ScholarCross RefCross Ref
  28. Eunjung Park, John Cavazos, and Marco A. Alvarez. 2012. Using graph-based program characterization for predictive modeling. In CGO. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Eunjung Park, Louis-Noël Pouchet, John Cavazos, Albert Cohen, and P. Sadayappan. 2011. Predictive modeling in a polyhedral optimization space. In CGO. IEEE. Google ScholarGoogle ScholarCross RefCross Ref
  30. Simon Peyton Jones, Andrew Tolmach, and Tony Hoare. 2001. Playing by the rules: rewriting as a practical optimisation technique in GHC. ACM SIGPLAN.Google ScholarGoogle Scholar
  31. Ari Rasch, Michael Haidl, and Sergei Gorlatch. 2017. ATF: A Generic AutoTuning Framework. In 19th IEEE International Conference on High Performance Computing and Communications; 15th IEEE International Conference on Smart City; 3rd IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2017, Bangkok, Thailand, December 18--20, 2017. 64--71. Google ScholarGoogle ScholarCross RefCross Ref
  32. Ricardo Nabinger Sanchez, José Nelson Amaral, Duane Szafron, Marius Pirvu, and Mark G. Stoodley. 2011. Using machines to learn method-specific compilation strategies. In CGO. IEEE. Google ScholarGoogle ScholarCross RefCross Ref
  33. Jaewoong Sim, Aniruddha Dasgupta, Hyesoon Kim, and Richard W. Vuduc. 2012. A performance analysis framework for identifying potential benefits in GPGPU applications. In PPoPP. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Michel Steuwer, Christian Fensch, Sam Lindley, and Christophe Dubach. 2015. Generating Performance Portable Code Using Rewrite Rules: From High-level Functional Expressions to High-performance OpenCL Code. In Proceedings of the 20th ACM SIGPLAN International Conference on Functional Programming (ICFP 2015). ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Michel Steuwer, Toomas Remmelg, and Christophe Dubach. 2017. Lift: a functional data-parallel IR for high-performance GPU code generation. In CGO. http://dl.acm.org/citation.cfm?id=3049841Google ScholarGoogle Scholar
  36. Kevin Stock, Louis-Noël Pouchet, and P. Sadayappan. 2012. Using machine learning to improve automatic vectorization. ACM TACO 8, 4 (2012), 50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Michele Tartara and Stefano Crespi-Reghizzi. 2013. Continuous learning of compiler heuristics. ACM TACO 9, 4 (2013), 46.Google ScholarGoogle Scholar
  38. Xiuxia Zhang, Guangming Tan, Shuangbai Xue, Jiajia Li, Keren Zhou, and Mingyu Chen. 2017. Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning. In Proceedings of the 22Nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17). ACM, New York, NY, USA, 31--43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  1. High-level hardware feature extraction for GPU performance prediction of stencils

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      GPGPU '20: Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit
      February 2020
      77 pages
      ISBN:9781450370257
      DOI:10.1145/3366428

      Copyright © 2020 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 23 February 2020

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      GPGPU '20 Paper Acceptance Rate7of12submissions,58%Overall Acceptance Rate57of129submissions,44%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader