research-article

GPUscout: Locating Data Movement-related Bottlenecks on GPUs

Authors:
Soumya Sen

Technical University Munich, Germany

Technical University Munich, Germany

0009-0008-0337-9964
View Profile

,
Stepan Vanecek

Technical University Munich, Germany

Technical University Munich, Germany

0009-0008-4120-9472
View Profile

,
Martin Schulz

Technical University Munich, Germany

Technical University Munich, Germany

0000-0001-9013-435X
View Profile

SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and AnalysisNovember 2023Pages 1392–1402https://doi.org/10.1145/3624062.3624208

Published:12 November 2023Publication History

SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis

Pages 1392–1402

ABSTRACT

GPUs pose an attractive opportunity for delivering high-performance applications. However, GPU codes are often limited due to memory contention, resulting in overall performance degradation. Since GPU scheduling is transparent to the user, and GPU memory architectures are very complex compared to ones on CPUs, finding such bottlenecks is a very cumbersome process.

In this paper, we present a novel method of systematically detecting the root cause of frequent memory performance bottlenecks on NVIDIA GPUs that we call GPUscout. It connects three approaches to analyzing performance – static CUDA SASS code analysis, sampling warp stalls, and kernel performance metrics. Connecting these approaches, GPUscout can identify the problem, locate the code segment where it originates, and assess its importance.

This paper illustrates the capabilities and the design of our implementation of GPUscout. We show its applicability based on three commonly-used kernels, yielding promising results in terms of accuracy, efficiency, and usability.

References

Ronan Amorim, Gundolf Haase, Manfred Liebmann, and Rodrigo Santos. 2009. Comparing CUDA and OpenGL implementations for a Jacobi iteration, In 2009 International Conference on High Performance Computing and Simulation. Proceedings of the 2009 International Conference on High Performance Computing and Simulation, HPCS 2009, 22 – 32. https://doi.org/10.1109/HPCSIM.2009.5192847Google ScholarCross Ref
Lorenz Braun and Holger Fröning. 2019. CUDA Flux: A Lightweight Instruction Profiler for CUDA Applications. In 2019 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). 73–81. https://doi.org/10.1109/PMBS49563.2019.00014Google ScholarCross Ref
S. Browne, J. Dongarra, N. Garner, G. Ho, and P. Mucci. 2000. A Portable Programming Interface for Performance Evaluation on Modern Processors. Int. J. High Perform. Comput. Appl. 14, 3 (aug 2000), 189–204. https://doi.org/10.1177/109434200001400303Google ScholarDigital Library
José Cecilia, José García, and Manuel Ujaldon. 2010. CUDA 2D Stencil Computations for the Jacobi Method, In Proceedings of the 10th International Conference on Applied Parallel and Scientific Computing - Volume Part I (Reykjavík, Iceland). Para 2010 - State of the Art in Scientific and Parallel Computing I, 173–183. https://doi.org/10.1007/978-3-642-28151-8_17Google ScholarDigital Library
Gautam Chakrabarti, Vinod Grover, Bastiaan Aarts, Xiangyun Kong, Manjunath Kudlur, Yuan Lin, Jaydeep Marathe, Mike Murphy, and Jian-Zhong Wang. 2012. CUDA: Compiling and optimizing for a GPU platform. Procedia Computer Science 9 (12 2012), 1910–1919. https://doi.org/10.1016/j.procs.2012.04.209Google ScholarCross Ref
Markus Geimer, Felix Wolf, Brian J. N. Wylie, Erika Ábrahám, Daniel Becker, and Bernd Mohr. 2010. The Scalasca Performance Toolset Architecture. Concurr. Comput.: Pract. Exper. 22, 6 (apr 2010), 702–719.Google ScholarDigital Library
Yueming Hao, Nikhil Jain, Rob Van der Wijngaart, Nirmal Saxena, Yuanbo Fan, and Xu Liu. 2023. DrGPU: A Top-Down Profiler for GPU Applications. In Proceedings of the 2023 ACM/SPEC International Conference on Performance Engineering (Coimbra, Portugal) (ICPE ’23). Association for Computing Machinery, New York, NY, USA, 43–53. https://doi.org/10.1145/3578244.3583736Google ScholarDigital Library
Andreas Knüpfer, Christian Feld, Dieter Mey, Scott Biersdorff, Kai Diethelm, Dominic Eschweiler, Markus Geimer, Michael Gerndt, Daniel Lorenz, Allen Malony, Wolfgang Nagel, Yury Oleynik, Peter Philippen, Pavel Saviankou, Dirk Schmidl, Sameer Shende, Ronny Tschüter, Michael Wagner, Bert Wesarg, and Felix Wolf. 2012. Score-P: A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir. Springer, Berlin, Heidelberg, 79–91. https://doi.org/10.1007/978-3-642-31476-6_7Google ScholarCross Ref
Peter Kogge and John Shalf. 2013. Exascale Computing Trends: Adjusting to the New Normal for Computer Architecture. Computing in Science & Engineering 15 (11 2013), 16–26. https://doi.org/10.1109/MCSE.2013.95Google ScholarDigital Library
Elias Konstantinidis. 2015. mixbench. https://github.com/ekondis/mixbench,. commit: 8a3585e3cf32a062192396cbc560afe6abb566d0.Google Scholar
Elias Konstantinidis and Yiannis Cotronis. 2017. A quantitative roofline model for GPU kernel performance estimation using micro-benchmarks and hardware metric profiling. J. Parallel and Distrib. Comput. 107 (04 2017), 37–56. https://doi.org/10.1016/j.jpdc.2017.04.002Google ScholarCross Ref
Robert V. Lim, Allen D. Malony, Boyana Norris, and Nicholas Chaimov. 2015. Identifying Optimization Opportunities Within Kernel Execution in GPU Codes. In Euro-Par Workshops.Google Scholar
Allen D. Malony, Scott Biersdorff, Wyatt Spear, and Shangkar Mayanglambam. 2010. An Experimental Approach to Performance Measurement of Heterogeneous Parallel Applications Using CUDA. In Proceedings of the 24th ACM International Conference on Supercomputing (Tsukuba, Ibaraki, Japan) (ICS ’10). Association for Computing Machinery, New York, NY, USA, 127–136. https://doi.org/10.1145/1810085.1810105Google ScholarDigital Library
Shangkar Mayanglambam, Allen D. Malony, and Matthew J. Sottile. 2009. Performance Measurement of Applications with GPU Acceleration using CUDA. In International Conference on Parallel Computing.Google Scholar
Wolfgang E. Nagel, Alfred Arnold, Michael Weber, Hans-Christian Hoppe, and Karl Solchenbach. 1996. VAMPIR: Visualization and Analysis of MPI Resources. Supercomputer 63, Vol. XII, 1 (1996), 69–80. https://juser.fz-juelich.de/record/189233Google Scholar
NVIDIA. 2020. CUDA, release: 10.2.89. https://docs.nvidia.com/cuda/cuda-c-programming-guide/. Accessed: 2023-04-15.Google Scholar
NVIDIA. 2022. CUDA Profiling Tools Interface (CUPTI), release: 11.8.0. https://docs.nvidia.com/cuda/cupti/index.html. Accessed: 2023-04-15.Google Scholar
NVIDIA. 2023. CUDA Binary Utilities, release: 12.0. https://docs.nvidia.com/cuda/cuda-binary-utilities/. Accessed: 2023-04-15.Google Scholar
NVIDIA. 2023. CUDA Profiler, release: 12.1. https://docs.nvidia.com/cuda/pdf/CUDA_Profiler_Users_Guide.pdf. Accessed: 2023-04-15.Google Scholar
NVIDIA. 2023. Kernel Profiling Guide, release: 2022.4.1. https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html. Accessed: 2023-04-15.Google Scholar
NVIDIA. 2023. Nsight Compute CLI, release: 2022.4.1. https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html. Accessed: 2023-04-15.Google Scholar
Huabin Ruan, Xiaomeng Huang, Haohuan Fu, and Guangwen Yang. 2013. Jacobi Solver: A Fast FPGA-based Engine System for Jacobi Method. Research Journal of Applied Sciences, Engineering and Technology 6 (12 2013), 4459–4463. https://doi.org/10.19026/rjaset.6.3452Google ScholarCross Ref
Sameer S. Shende and Allen D. Malony. 2006. The Tau Parallel Performance System. Int. J. High Perform. Comput. Appl. 20, 2 (May 2006), 287–311. https://doi.org/10.1177/1094342006064482Google ScholarDigital Library
Siham Tabik, Maurice Peemen, Nicolas Guil, and Henk Corporaal. 2015. Demystifying the 16 x 16 thread-block for stencils on the GPU. https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.3591. Concurrency and Computation: Practice and Experience 27, 18 (2015), 5557–5573. https://doi.org/10.1002/cpe.3591 Accessed: 2023-04-15.Google ScholarDigital Library
Keren Zhou, Laksono Adhianto, Jonathon Anderson, Aaron Cherian, Dejan Grubisic, Mark Krentel, Yumeng Liu, Xiaozhu Meng, and John Mellor-Crummey. 2021. Measurement and Analysis of GPU-Accelerated Applications with HPCToolkit. Parallel Comput. 108, C (dec 2021), 12 pages. https://doi.org/10.1016/j.parco.2021.102837Google ScholarDigital Library
Keren Zhou, Xiaozhu Meng, Ryuichi Sai, and John Mellor-Crummey. 2021. GPA: A GPU Performance Advisor Based on Instruction Sampling. In Proceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization (Virtual Event, Republic of Korea) (CGO ’21). IEEE Press, 115–125. https://doi.org/10.1109/CGO51591.2021.9370339Google ScholarDigital Library

Index Terms

GPUscout: Locating Data Movement-related Bottlenecks on GPUs

Recommendations

Architecture-Aware Mapping and Optimization on a 1600-Core GPU
ICPADS '11: Proceedings of the 2011 IEEE 17th International Conference on Parallel and Distributed Systems

The graphics processing unit (GPU) continues to make in-roads as a computational accelerator for high-performance computing (HPC). However, despite its increasing popularity, mapping and optimizing GPU code remains a difficult task, it is a multi-...
Read More
Galactica: A GPU Parallelized Database Accelerator
BigDataScience '14: Proceedings of the 2014 International Conference on Big Data Science and Computing

The amount of business data generated and collected is increasing exponentially every year. A Graphics Processing Unit (GPU) is not used for only optimization of image filtering and video processing, but is also widely adopted for accelerating big data ...
Read More
Evaluating the Performance of the hipSYCL Toolchain for HPC Kernels on NVIDIA V100 GPUs
IWOCL '20: Proceedings of the International Workshop on OpenCL

Future HPC leadership computing systems for the United States Department of Energy will utilize GPUs for acceleration of scientific codes. These systems will utilize GPUs from various vendors which places a large focus on the performance portability of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis
November 2023
2180 pages
ISBN:9798400707858
DOI:10.1145/3624062

Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 November 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
CUDA
Data-movement
GPU
High performance computing
NVIDIA
Performance analysis
Profiler
SASS
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 61
  Total Downloads
- Downloads (Last 12 months)61
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

GPUscout: Locating Data Movement-related Bottlenecks on GPUs

SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

Architecture-Aware Mapping and Optimization on a 1600-Core GPU

Galactica: A GPU Parallelized Database Accelerator

Evaluating the Performance of the hipSYCL Toolchain for HPC Kernels on NVIDIA V100 GPUs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

GPUscout: Locating Data Movement-related Bottlenecks on GPUs

SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

Architecture-Aware Mapping and Optimization on a 1600-Core GPU

Galactica: A GPU Parallelized Database Accelerator

Evaluating the Performance of the hipSYCL Toolchain for HPC Kernels on NVIDIA V100 GPUs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media