ABSTRACT
GPUs pose an attractive opportunity for delivering high-performance applications. However, GPU codes are often limited due to memory contention, resulting in overall performance degradation. Since GPU scheduling is transparent to the user, and GPU memory architectures are very complex compared to ones on CPUs, finding such bottlenecks is a very cumbersome process.
In this paper, we present a novel method of systematically detecting the root cause of frequent memory performance bottlenecks on NVIDIA GPUs that we call GPUscout. It connects three approaches to analyzing performance – static CUDA SASS code analysis, sampling warp stalls, and kernel performance metrics. Connecting these approaches, GPUscout can identify the problem, locate the code segment where it originates, and assess its importance.
This paper illustrates the capabilities and the design of our implementation of GPUscout. We show its applicability based on three commonly-used kernels, yielding promising results in terms of accuracy, efficiency, and usability.
- Ronan Amorim, Gundolf Haase, Manfred Liebmann, and Rodrigo Santos. 2009. Comparing CUDA and OpenGL implementations for a Jacobi iteration, In 2009 International Conference on High Performance Computing and Simulation. Proceedings of the 2009 International Conference on High Performance Computing and Simulation, HPCS 2009, 22 – 32. https://doi.org/10.1109/HPCSIM.2009.5192847Google ScholarCross Ref
- Lorenz Braun and Holger Fröning. 2019. CUDA Flux: A Lightweight Instruction Profiler for CUDA Applications. In 2019 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). 73–81. https://doi.org/10.1109/PMBS49563.2019.00014Google ScholarCross Ref
- S. Browne, J. Dongarra, N. Garner, G. Ho, and P. Mucci. 2000. A Portable Programming Interface for Performance Evaluation on Modern Processors. Int. J. High Perform. Comput. Appl. 14, 3 (aug 2000), 189–204. https://doi.org/10.1177/109434200001400303Google ScholarDigital Library
- José Cecilia, José García, and Manuel Ujaldon. 2010. CUDA 2D Stencil Computations for the Jacobi Method, In Proceedings of the 10th International Conference on Applied Parallel and Scientific Computing - Volume Part I (Reykjavík, Iceland). Para 2010 - State of the Art in Scientific and Parallel Computing I, 173–183. https://doi.org/10.1007/978-3-642-28151-8_17Google ScholarDigital Library
- Gautam Chakrabarti, Vinod Grover, Bastiaan Aarts, Xiangyun Kong, Manjunath Kudlur, Yuan Lin, Jaydeep Marathe, Mike Murphy, and Jian-Zhong Wang. 2012. CUDA: Compiling and optimizing for a GPU platform. Procedia Computer Science 9 (12 2012), 1910–1919. https://doi.org/10.1016/j.procs.2012.04.209Google ScholarCross Ref
- Markus Geimer, Felix Wolf, Brian J. N. Wylie, Erika Ábrahám, Daniel Becker, and Bernd Mohr. 2010. The Scalasca Performance Toolset Architecture. Concurr. Comput.: Pract. Exper. 22, 6 (apr 2010), 702–719.Google ScholarDigital Library
- Yueming Hao, Nikhil Jain, Rob Van der Wijngaart, Nirmal Saxena, Yuanbo Fan, and Xu Liu. 2023. DrGPU: A Top-Down Profiler for GPU Applications. In Proceedings of the 2023 ACM/SPEC International Conference on Performance Engineering (Coimbra, Portugal) (ICPE ’23). Association for Computing Machinery, New York, NY, USA, 43–53. https://doi.org/10.1145/3578244.3583736Google ScholarDigital Library
- Andreas Knüpfer, Christian Feld, Dieter Mey, Scott Biersdorff, Kai Diethelm, Dominic Eschweiler, Markus Geimer, Michael Gerndt, Daniel Lorenz, Allen Malony, Wolfgang Nagel, Yury Oleynik, Peter Philippen, Pavel Saviankou, Dirk Schmidl, Sameer Shende, Ronny Tschüter, Michael Wagner, Bert Wesarg, and Felix Wolf. 2012. Score-P: A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir. Springer, Berlin, Heidelberg, 79–91. https://doi.org/10.1007/978-3-642-31476-6_7Google ScholarCross Ref
- Peter Kogge and John Shalf. 2013. Exascale Computing Trends: Adjusting to the New Normal for Computer Architecture. Computing in Science & Engineering 15 (11 2013), 16–26. https://doi.org/10.1109/MCSE.2013.95Google ScholarDigital Library
- Elias Konstantinidis. 2015. mixbench. https://github.com/ekondis/mixbench,. commit: 8a3585e3cf32a062192396cbc560afe6abb566d0.Google Scholar
- Elias Konstantinidis and Yiannis Cotronis. 2017. A quantitative roofline model for GPU kernel performance estimation using micro-benchmarks and hardware metric profiling. J. Parallel and Distrib. Comput. 107 (04 2017), 37–56. https://doi.org/10.1016/j.jpdc.2017.04.002Google ScholarCross Ref
- Robert V. Lim, Allen D. Malony, Boyana Norris, and Nicholas Chaimov. 2015. Identifying Optimization Opportunities Within Kernel Execution in GPU Codes. In Euro-Par Workshops.Google Scholar
- Allen D. Malony, Scott Biersdorff, Wyatt Spear, and Shangkar Mayanglambam. 2010. An Experimental Approach to Performance Measurement of Heterogeneous Parallel Applications Using CUDA. In Proceedings of the 24th ACM International Conference on Supercomputing (Tsukuba, Ibaraki, Japan) (ICS ’10). Association for Computing Machinery, New York, NY, USA, 127–136. https://doi.org/10.1145/1810085.1810105Google ScholarDigital Library
- Shangkar Mayanglambam, Allen D. Malony, and Matthew J. Sottile. 2009. Performance Measurement of Applications with GPU Acceleration using CUDA. In International Conference on Parallel Computing.Google Scholar
- Wolfgang E. Nagel, Alfred Arnold, Michael Weber, Hans-Christian Hoppe, and Karl Solchenbach. 1996. VAMPIR: Visualization and Analysis of MPI Resources. Supercomputer 63, Vol. XII, 1 (1996), 69–80. https://juser.fz-juelich.de/record/189233Google Scholar
- NVIDIA. 2020. CUDA, release: 10.2.89. https://docs.nvidia.com/cuda/cuda-c-programming-guide/. Accessed: 2023-04-15.Google Scholar
- NVIDIA. 2022. CUDA Profiling Tools Interface (CUPTI), release: 11.8.0. https://docs.nvidia.com/cuda/cupti/index.html. Accessed: 2023-04-15.Google Scholar
- NVIDIA. 2023. CUDA Binary Utilities, release: 12.0. https://docs.nvidia.com/cuda/cuda-binary-utilities/. Accessed: 2023-04-15.Google Scholar
- NVIDIA. 2023. CUDA Profiler, release: 12.1. https://docs.nvidia.com/cuda/pdf/CUDA_Profiler_Users_Guide.pdf. Accessed: 2023-04-15.Google Scholar
- NVIDIA. 2023. Kernel Profiling Guide, release: 2022.4.1. https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html. Accessed: 2023-04-15.Google Scholar
- NVIDIA. 2023. Nsight Compute CLI, release: 2022.4.1. https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html. Accessed: 2023-04-15.Google Scholar
- Huabin Ruan, Xiaomeng Huang, Haohuan Fu, and Guangwen Yang. 2013. Jacobi Solver: A Fast FPGA-based Engine System for Jacobi Method. Research Journal of Applied Sciences, Engineering and Technology 6 (12 2013), 4459–4463. https://doi.org/10.19026/rjaset.6.3452Google ScholarCross Ref
- Sameer S. Shende and Allen D. Malony. 2006. The Tau Parallel Performance System. Int. J. High Perform. Comput. Appl. 20, 2 (May 2006), 287–311. https://doi.org/10.1177/1094342006064482Google ScholarDigital Library
- Siham Tabik, Maurice Peemen, Nicolas Guil, and Henk Corporaal. 2015. Demystifying the 16 x 16 thread-block for stencils on the GPU. https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.3591. Concurrency and Computation: Practice and Experience 27, 18 (2015), 5557–5573. https://doi.org/10.1002/cpe.3591 Accessed: 2023-04-15.Google ScholarDigital Library
- Keren Zhou, Laksono Adhianto, Jonathon Anderson, Aaron Cherian, Dejan Grubisic, Mark Krentel, Yumeng Liu, Xiaozhu Meng, and John Mellor-Crummey. 2021. Measurement and Analysis of GPU-Accelerated Applications with HPCToolkit. Parallel Comput. 108, C (dec 2021), 12 pages. https://doi.org/10.1016/j.parco.2021.102837Google ScholarDigital Library
- Keren Zhou, Xiaozhu Meng, Ryuichi Sai, and John Mellor-Crummey. 2021. GPA: A GPU Performance Advisor Based on Instruction Sampling. In Proceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization (Virtual Event, Republic of Korea) (CGO ’21). IEEE Press, 115–125. https://doi.org/10.1109/CGO51591.2021.9370339Google ScholarDigital Library
Index Terms
- GPUscout: Locating Data Movement-related Bottlenecks on GPUs
Recommendations
Architecture-Aware Mapping and Optimization on a 1600-Core GPU
ICPADS '11: Proceedings of the 2011 IEEE 17th International Conference on Parallel and Distributed SystemsThe graphics processing unit (GPU) continues to make in-roads as a computational accelerator for high-performance computing (HPC). However, despite its increasing popularity, mapping and optimizing GPU code remains a difficult task, it is a multi-...
Galactica: A GPU Parallelized Database Accelerator
BigDataScience '14: Proceedings of the 2014 International Conference on Big Data Science and ComputingThe amount of business data generated and collected is increasing exponentially every year. A Graphics Processing Unit (GPU) is not used for only optimization of image filtering and video processing, but is also widely adopted for accelerating big data ...
Evaluating the Performance of the hipSYCL Toolchain for HPC Kernels on NVIDIA V100 GPUs
IWOCL '20: Proceedings of the International Workshop on OpenCLFuture HPC leadership computing systems for the United States Department of Energy will utilize GPUs for acceleration of scientific codes. These systems will utilize GPUs from various vendors which places a large focus on the performance portability of ...
Comments