research-article

High-level hardware feature extraction for GPU performance prediction of stencils

Authors:
Toomas Remmelg

The University of Edinburgh, Edinburgh, Scotland, United Kingdom

The University of Edinburgh, Edinburgh, Scotland, United Kingdom
View Profile

,
Bastian Hagedorn

University of Münster, Münster, Germany

University of Münster, Münster, Germany
View Profile

,
Lu Li

The University of Edinburgh, Edinburgh, Scotland, United Kingdom

The University of Edinburgh, Edinburgh, Scotland, United Kingdom
View Profile

,
Michel Steuwer

University of Glasgow, Glasgow, Scotland, United Kingdom

University of Glasgow, Glasgow, Scotland, United Kingdom
View Profile

,
Sergei Gorlatch

University of Münster, Münster, Germany

University of Münster, Münster, Germany
View Profile

,
Christophe Dubach

The University of Edinburgh, Edinburgh, Scotland, United Kingdom

The University of Edinburgh, Edinburgh, Scotland, United Kingdom
View Profile

GPGPU '20: Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing UnitFebruary 2020Pages 21–30https://doi.org/10.1145/3366428.3380769

Published:23 February 2020Publication History

GPGPU '20: Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit

Pages 21–30

ABSTRACT

High-level functional programming abstractions have started to show promising results for HPC (High-Performance Computing). Approaches such as Lift, Futhark or Delite have shown that it is possible to have both, high-level abstractions and performance, even for HPC workloads such as stencils. In addition, these high-level functional abstractions can also be used to represent programs and their optimized variants, within the compiler itself. However, such high-level approaches rely heavily on the compiler to optimize programs which is notoriously hard when targeting GPUs.

Compilers either use hand-crafted heuristics to direct the optimizations or iterative compilation to search the optimization space. The first approach has fast compile times, however, it is not performance-portable across different devices and requires a lot of human effort to build the heuristics. Iterative compilation, on the other hand, has the ability to search the optimization space automatically and adapts to different devices. However, this process is often very time-consuming as thousands of variants have to be evaluated. Performance models based on statistical techniques have been proposed to speedup the optimization space exploration. However, they rely on low-level hardware features, in the form of performance counters or low-level static code features.

Using the Lift framework, this paper demonstrates how low-level, GPU-specific features are extractable directly from a high-level functional representation. The Lift IR (Intermediate Representation) is in fact a very suitable choice since all optimization choices are exposed at the IR level. This paper shows how to extract low-level features such as number of unique cache lines accessed per warp, which is crucial for building accurate GPU performance models. Using this approach, we are able to speedup the exploration of the space by a factor 2000x on an AMD GPU and 450x on Nvidia on average across many stencil applications.

References

Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O'Reilly, and Saman P. Amarasinghe. 2014. Open-Tuner: an extensible framework for program autotuning. In PACT. ACM. Google ScholarDigital Library
Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In ISPASS. IEEE. Google ScholarCross Ref
Ulysse Beaugnon, Antoine Pouille, Marc Pouzet, Jacques Pienaar, and Albert Cohen. 2017. Optimization Space Pruning Without Regrets. In CC. ACM. Google ScholarDigital Library
Kevin J. Brown, Arvind K. Sujeeth, Hyouk Joong Lee, Tiark Rompf, Hassan Chafi, Martin Odersky, and Kunle Olukotun. 2011. A Heterogeneous Parallel Framework for Domain-Specific Languages. In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT '11).Google ScholarDigital Library
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 578--594.Google Scholar
Alexander Collins, Christian Fensch, Hugh Leather, and Murray Cole. 2013. MaSiF: Machine learning guided auto-tuning of parallel skeletons. In HiPC. IEEE. Google ScholarCross Ref
John Demme and Simha Sethumadhavan. 2012. Approximate graph clustering for program characterization. ACM TACO 8, 4 (2012), 21. Google ScholarDigital Library
Christophe Dubach, John Cavazos, Björn Franke, Grigori Fursin, Michael F. P. O'Boyle, and Olivier Temam. 2007. Fast compiler optimisation evaluation using code-feature based performance prediction. In CF. ACM. Google ScholarDigital Library
Bastian Hagedorn, Larisa Stoltzfus, Michel Steuwer, Sergei Gorlatch, and Christophe Dubach. 2018. High Performance Stencil Code Generation with Lift. In CGO. ACM, New York, NY, USA, 100--112. Google ScholarDigital Library
Troels Henriksen, Niels G. W. Serup, Martin Elsman, Fritz Henglein, and Cosmin E. Oancea. 2017. Futhark: Purely Functional GPU-programming with Nested Parallelism and In-place Array Updates. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2017).Google ScholarDigital Library
Changwan Hong, Aravind Sukumaran-Rajam, Jinsung Kim, Prashant Singh Rawat, Sriram Krishnamoorthy, Louis-Noël Pouchet, Fabrice Rastello, and P. Sadayappan. 2018. GPU Code Optimization Using Abstract Kernel Emulation and Sensitivity Analysis. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2018). ACM, New York, NY, USA, 736--751. Google ScholarDigital Library
Sunpyo Hong and Hyesoon Kim. 2009. An Analytical Model for a GPU Architecture with Memory-level and Thread-level Parallelism Awareness. In ISCA. ACM. Google ScholarDigital Library
Wenhao Jia, Kelly A. Shaw, and Margaret Martonosi. 2012. Stargazer: Automated regression-based GPU design space exploration. In ISPASS, Rajeev Balasubramonian and Vijayalakshmi Srinivasan (Eds.). IEEE. Google ScholarDigital Library
Wenhao Jia, Kelly A. Shaw, and Margaret Martonosi. 2013. Starchart: Hardware and software optimization using recursive partitioning regression trees. In PACT. IEEE. Google ScholarCross Ref
Andrew Kerr, Gregory Diamos, and Sudhakar Yalamanchili. 2010. Modeling GPU-CPU Workloads and Systems. In GPGPU. ACM. Google ScholarDigital Library
Yooseong Kim and Aviral Shrivastava. 2011. CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA. In DAC. ACM, 6. Google ScholarDigital Library
Hugh Leather, Edwin V. Bonilla, and Michael F. P. O'Boyle. 2009. Automatic Feature Generation for Machine Learning Based Optimizing Compilation. In CGO. IEEE. Google ScholarDigital Library
Seyong Lee, Jeremy S. Meredith, and Jeffrey S. Vetter. 2015. COMPASS: A Framework for Automated Performance Modeling and Prediction. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS '15). ACM, 10.Google Scholar
Roland Leissa, Klaas Boesche, Sebastian Hack, Arsène Pérard-Gayot, Richard Membarth, Philipp Slusallek, André Müller, and Bertil Schmidt. 2018. AnyDSL: A Partial Evaluation Framework for Programming High-performance Libraries. Proc. ACM Program. Lang. 2, OOPSLA, Article 119 (Oct. 2018), 30 pages.Google Scholar
Souley Madougou, Ana Lucia Varbanescu, and Cees de Laat. 2016. Using Colored Petri Nets for GPGPU Performance Modeling. In CF. ACM. Google ScholarDigital Library
Alberto Magni, Christophe Dubach, and Michael F. P. O'Boyle. 2014. Automatic optimization of thread-coarsening for graphics processors. In PACT. ACM. Google ScholarDigital Library
Trevor L. McDonell, Manuel M T Chakravarty, Gabriele Keller, and Ben Lippmeier. 2013. Optimising Purely Functional GPU Programs. In ICFP '13: The 18th ACM SIGPLAN International Conference on Functional Programming. ACM.Google Scholar
Jiayuan Meng, Vitali A. Morozov, Kalyan Kumaran, Venkatram Vishwanath, and Thomas D. Uram. 2011. GROPHECY: GPU performance projection from CPU code skeletons. In SC. ACM. Google ScholarDigital Library
Mircea Namolaru, Albert Cohen, Grigori Fursin, Ayal Zaks, and Ari Freund. 2010. Practical Aggregation of Semantical Program Properties for Machine Learning Based Optimization. In CASES. ACM. Google ScholarDigital Library
Cedric Nugteren and Valeriu Codreanu. 2015. CLTune: A Generic Auto-Tuner for OpenCL Kernels. In MCSoC. IEEE. Google ScholarDigital Library
Cedric Nugteren and Henk Corporaal. 2012. The boat hull model: enabling performance prediction for parallel computing prior to code development. In CF. ACM. Google ScholarDigital Library
Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal, and Henri E. Bal. 2014. A detailed GPU cache model based on reuse distance theory. In HPCA. IEEE. Google ScholarCross Ref
Eunjung Park, John Cavazos, and Marco A. Alvarez. 2012. Using graph-based program characterization for predictive modeling. In CGO. ACM. Google ScholarDigital Library
Eunjung Park, Louis-Noël Pouchet, John Cavazos, Albert Cohen, and P. Sadayappan. 2011. Predictive modeling in a polyhedral optimization space. In CGO. IEEE. Google ScholarCross Ref
Simon Peyton Jones, Andrew Tolmach, and Tony Hoare. 2001. Playing by the rules: rewriting as a practical optimisation technique in GHC. ACM SIGPLAN.Google Scholar
Ari Rasch, Michael Haidl, and Sergei Gorlatch. 2017. ATF: A Generic AutoTuning Framework. In 19th IEEE International Conference on High Performance Computing and Communications; 15th IEEE International Conference on Smart City; 3rd IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2017, Bangkok, Thailand, December 18--20, 2017. 64--71. Google ScholarCross Ref
Ricardo Nabinger Sanchez, José Nelson Amaral, Duane Szafron, Marius Pirvu, and Mark G. Stoodley. 2011. Using machines to learn method-specific compilation strategies. In CGO. IEEE. Google ScholarCross Ref
Jaewoong Sim, Aniruddha Dasgupta, Hyesoon Kim, and Richard W. Vuduc. 2012. A performance analysis framework for identifying potential benefits in GPGPU applications. In PPoPP. ACM. Google ScholarDigital Library
Michel Steuwer, Christian Fensch, Sam Lindley, and Christophe Dubach. 2015. Generating Performance Portable Code Using Rewrite Rules: From High-level Functional Expressions to High-performance OpenCL Code. In Proceedings of the 20th ACM SIGPLAN International Conference on Functional Programming (ICFP 2015). ACM.Google ScholarDigital Library
Michel Steuwer, Toomas Remmelg, and Christophe Dubach. 2017. Lift: a functional data-parallel IR for high-performance GPU code generation. In CGO. http://dl.acm.org/citation.cfm?id=3049841Google Scholar
Kevin Stock, Louis-Noël Pouchet, and P. Sadayappan. 2012. Using machine learning to improve automatic vectorization. ACM TACO 8, 4 (2012), 50. Google ScholarDigital Library
Michele Tartara and Stefano Crespi-Reghizzi. 2013. Continuous learning of compiler heuristics. ACM TACO 9, 4 (2013), 46.Google Scholar
Xiuxia Zhang, Guangming Tan, Shuangbai Xue, Jiajia Li, Keren Zhou, and Mingyu Chen. 2017. Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning. In Proceedings of the 22Nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17). ACM, New York, NY, USA, 31--43. Google ScholarDigital Library

High-level hardware feature extraction for GPU performance prediction of stencils
1. Software and its engineering
  1. Software notations and tools

Recommendations

Performance Portability Evaluation of Blocked Stencil Computations on GPUs
SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis

In this new era where multiple GPU vendors are leading the supercomputing landscape, and multiple programming models are available to users, the drive to achieve performance portability across platforms faces new challenges. Consider stencil algorithms, ...
Read More
High-Performance Reverse Time Migration on GPU
SCCC '09: Proceedings of the 2009 International Conference of the Chilean Computer Science Society

Partial Differential Equations (PDE) are the heart of most simulations in many scientific fields, from Fluid Mechanics to Astrophysics. One the most popular mathematical schemes to solve a PDE is Finite Difference (FD). In this work we map a PDE-FD ...
Read More
New High Performance GPGPU Code Transformation Framework Applied to Large Production Weather Prediction Code

We introduce “Hybrid Fortran,” a new approach that allows a high-performance GPGPU port for structured grid Fortran codes. This technique only requires minimal changes for a CPU targeted codebase, which is a significant advancement in terms of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
GPGPU '20: Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit
February 2020
77 pages
ISBN:9781450370257
DOI:10.1145/3366428
Conference Chairs:
Adwait Jog,
Onur Kayiran,
Ashutosh Pattnaik
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 February 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
GPUs optimizations
features extraction
performance models
stencil computation
Qualifiers
- research-article
Conference

Acceptance Rates
GPGPU '20 Paper Acceptance Rate7of12submissions,58%Overall Acceptance Rate57of129submissions,44%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 256
  Total Downloads
- Downloads (Last 12 months)27
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

High-level hardware feature extraction for GPU performance prediction of stencils

GPGPU '20: Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit

ABSTRACT

References

Cited By

Recommendations

Performance Portability Evaluation of Blocked Stencil Computations on GPUs

High-Performance Reverse Time Migration on GPU

New High Performance GPGPU Code Transformation Framework Applied to Large Production Weather Prediction Code