Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3624062.3624208acmotherconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

GPUscout: Locating Data Movement-related Bottlenecks on GPUs

Published:12 November 2023Publication History

ABSTRACT

GPUs pose an attractive opportunity for delivering high-performance applications. However, GPU codes are often limited due to memory contention, resulting in overall performance degradation. Since GPU scheduling is transparent to the user, and GPU memory architectures are very complex compared to ones on CPUs, finding such bottlenecks is a very cumbersome process.

In this paper, we present a novel method of systematically detecting the root cause of frequent memory performance bottlenecks on NVIDIA GPUs that we call GPUscout. It connects three approaches to analyzing performance – static CUDA SASS code analysis, sampling warp stalls, and kernel performance metrics. Connecting these approaches, GPUscout can identify the problem, locate the code segment where it originates, and assess its importance.

This paper illustrates the capabilities and the design of our implementation of GPUscout. We show its applicability based on three commonly-used kernels, yielding promising results in terms of accuracy, efficiency, and usability.

References

  1. Ronan Amorim, Gundolf Haase, Manfred Liebmann, and Rodrigo Santos. 2009. Comparing CUDA and OpenGL implementations for a Jacobi iteration, In 2009 International Conference on High Performance Computing and Simulation. Proceedings of the 2009 International Conference on High Performance Computing and Simulation, HPCS 2009, 22 – 32. https://doi.org/10.1109/HPCSIM.2009.5192847Google ScholarGoogle ScholarCross RefCross Ref
  2. Lorenz Braun and Holger Fröning. 2019. CUDA Flux: A Lightweight Instruction Profiler for CUDA Applications. In 2019 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). 73–81. https://doi.org/10.1109/PMBS49563.2019.00014Google ScholarGoogle ScholarCross RefCross Ref
  3. S. Browne, J. Dongarra, N. Garner, G. Ho, and P. Mucci. 2000. A Portable Programming Interface for Performance Evaluation on Modern Processors. Int. J. High Perform. Comput. Appl. 14, 3 (aug 2000), 189–204. https://doi.org/10.1177/109434200001400303Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. José Cecilia, José García, and Manuel Ujaldon. 2010. CUDA 2D Stencil Computations for the Jacobi Method, In Proceedings of the 10th International Conference on Applied Parallel and Scientific Computing - Volume Part I (Reykjavík, Iceland). Para 2010 - State of the Art in Scientific and Parallel Computing I, 173–183. https://doi.org/10.1007/978-3-642-28151-8_17Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Gautam Chakrabarti, Vinod Grover, Bastiaan Aarts, Xiangyun Kong, Manjunath Kudlur, Yuan Lin, Jaydeep Marathe, Mike Murphy, and Jian-Zhong Wang. 2012. CUDA: Compiling and optimizing for a GPU platform. Procedia Computer Science 9 (12 2012), 1910–1919. https://doi.org/10.1016/j.procs.2012.04.209Google ScholarGoogle ScholarCross RefCross Ref
  6. Markus Geimer, Felix Wolf, Brian J. N. Wylie, Erika Ábrahám, Daniel Becker, and Bernd Mohr. 2010. The Scalasca Performance Toolset Architecture. Concurr. Comput.: Pract. Exper. 22, 6 (apr 2010), 702–719.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Yueming Hao, Nikhil Jain, Rob Van der Wijngaart, Nirmal Saxena, Yuanbo Fan, and Xu Liu. 2023. DrGPU: A Top-Down Profiler for GPU Applications. In Proceedings of the 2023 ACM/SPEC International Conference on Performance Engineering (Coimbra, Portugal) (ICPE ’23). Association for Computing Machinery, New York, NY, USA, 43–53. https://doi.org/10.1145/3578244.3583736Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Andreas Knüpfer, Christian Feld, Dieter Mey, Scott Biersdorff, Kai Diethelm, Dominic Eschweiler, Markus Geimer, Michael Gerndt, Daniel Lorenz, Allen Malony, Wolfgang Nagel, Yury Oleynik, Peter Philippen, Pavel Saviankou, Dirk Schmidl, Sameer Shende, Ronny Tschüter, Michael Wagner, Bert Wesarg, and Felix Wolf. 2012. Score-P: A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir. Springer, Berlin, Heidelberg, 79–91. https://doi.org/10.1007/978-3-642-31476-6_7Google ScholarGoogle ScholarCross RefCross Ref
  9. Peter Kogge and John Shalf. 2013. Exascale Computing Trends: Adjusting to the New Normal for Computer Architecture. Computing in Science & Engineering 15 (11 2013), 16–26. https://doi.org/10.1109/MCSE.2013.95Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Elias Konstantinidis. 2015. mixbench. https://github.com/ekondis/mixbench,. commit: 8a3585e3cf32a062192396cbc560afe6abb566d0.Google ScholarGoogle Scholar
  11. Elias Konstantinidis and Yiannis Cotronis. 2017. A quantitative roofline model for GPU kernel performance estimation using micro-benchmarks and hardware metric profiling. J. Parallel and Distrib. Comput. 107 (04 2017), 37–56. https://doi.org/10.1016/j.jpdc.2017.04.002Google ScholarGoogle ScholarCross RefCross Ref
  12. Robert V. Lim, Allen D. Malony, Boyana Norris, and Nicholas Chaimov. 2015. Identifying Optimization Opportunities Within Kernel Execution in GPU Codes. In Euro-Par Workshops.Google ScholarGoogle Scholar
  13. Allen D. Malony, Scott Biersdorff, Wyatt Spear, and Shangkar Mayanglambam. 2010. An Experimental Approach to Performance Measurement of Heterogeneous Parallel Applications Using CUDA. In Proceedings of the 24th ACM International Conference on Supercomputing (Tsukuba, Ibaraki, Japan) (ICS ’10). Association for Computing Machinery, New York, NY, USA, 127–136. https://doi.org/10.1145/1810085.1810105Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Shangkar Mayanglambam, Allen D. Malony, and Matthew J. Sottile. 2009. Performance Measurement of Applications with GPU Acceleration using CUDA. In International Conference on Parallel Computing.Google ScholarGoogle Scholar
  15. Wolfgang E. Nagel, Alfred Arnold, Michael Weber, Hans-Christian Hoppe, and Karl Solchenbach. 1996. VAMPIR: Visualization and Analysis of MPI Resources. Supercomputer 63, Vol. XII, 1 (1996), 69–80. https://juser.fz-juelich.de/record/189233Google ScholarGoogle Scholar
  16. NVIDIA. 2020. CUDA, release: 10.2.89. https://docs.nvidia.com/cuda/cuda-c-programming-guide/. Accessed: 2023-04-15.Google ScholarGoogle Scholar
  17. NVIDIA. 2022. CUDA Profiling Tools Interface (CUPTI), release: 11.8.0. https://docs.nvidia.com/cuda/cupti/index.html. Accessed: 2023-04-15.Google ScholarGoogle Scholar
  18. NVIDIA. 2023. CUDA Binary Utilities, release: 12.0. https://docs.nvidia.com/cuda/cuda-binary-utilities/. Accessed: 2023-04-15.Google ScholarGoogle Scholar
  19. NVIDIA. 2023. CUDA Profiler, release: 12.1. https://docs.nvidia.com/cuda/pdf/CUDA_Profiler_Users_Guide.pdf. Accessed: 2023-04-15.Google ScholarGoogle Scholar
  20. NVIDIA. 2023. Kernel Profiling Guide, release: 2022.4.1. https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html. Accessed: 2023-04-15.Google ScholarGoogle Scholar
  21. NVIDIA. 2023. Nsight Compute CLI, release: 2022.4.1. https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html. Accessed: 2023-04-15.Google ScholarGoogle Scholar
  22. Huabin Ruan, Xiaomeng Huang, Haohuan Fu, and Guangwen Yang. 2013. Jacobi Solver: A Fast FPGA-based Engine System for Jacobi Method. Research Journal of Applied Sciences, Engineering and Technology 6 (12 2013), 4459–4463. https://doi.org/10.19026/rjaset.6.3452Google ScholarGoogle ScholarCross RefCross Ref
  23. Sameer S. Shende and Allen D. Malony. 2006. The Tau Parallel Performance System. Int. J. High Perform. Comput. Appl. 20, 2 (May 2006), 287–311. https://doi.org/10.1177/1094342006064482Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Siham Tabik, Maurice Peemen, Nicolas Guil, and Henk Corporaal. 2015. Demystifying the 16 x 16 thread-block for stencils on the GPU. https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.3591. Concurrency and Computation: Practice and Experience 27, 18 (2015), 5557–5573. https://doi.org/10.1002/cpe.3591 Accessed: 2023-04-15.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Keren Zhou, Laksono Adhianto, Jonathon Anderson, Aaron Cherian, Dejan Grubisic, Mark Krentel, Yumeng Liu, Xiaozhu Meng, and John Mellor-Crummey. 2021. Measurement and Analysis of GPU-Accelerated Applications with HPCToolkit. Parallel Comput. 108, C (dec 2021), 12 pages. https://doi.org/10.1016/j.parco.2021.102837Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Keren Zhou, Xiaozhu Meng, Ryuichi Sai, and John Mellor-Crummey. 2021. GPA: A GPU Performance Advisor Based on Instruction Sampling. In Proceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization (Virtual Event, Republic of Korea) (CGO ’21). IEEE Press, 115–125. https://doi.org/10.1109/CGO51591.2021.9370339Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. GPUscout: Locating Data Movement-related Bottlenecks on GPUs

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Other conferences
            SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis
            November 2023
            2180 pages
            ISBN:9798400707858
            DOI:10.1145/3624062

            Copyright © 2023 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 12 November 2023

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed limited
          • Article Metrics

            • Downloads (Last 12 months)61
            • Downloads (Last 6 weeks)1

            Other Metrics

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format .

          View HTML Format