Efficient Auto-Tuning of Parallel Programs with Interdependent Tuning Parameters via Auto-Tuning Framework (ATF)

Authors:
Ari Rasch

University of Muenster, Germany

University of Muenster, Germany

0000-0002-0286-0755
View Profile

,
Richard Schulze

University of Muenster, Germany

University of Muenster, Germany
View Profile

,
Michel Steuwer

University of Edinburgh, United Kingdom

University of Edinburgh, United Kingdom

0000-0001-5048-0741
View Profile

,
Sergei Gorlatch

University of Muenster, Germany

University of Muenster, Germany
View Profile

ACM Transactions on Architecture and Code Optimization Volume 18 Issue 1Article No.: 1pp 1–26https://doi.org/10.1145/3427093

Published:20 January 2021Publication History

ACM Transactions on Architecture and Code Optimization

Abstract

Auto-tuning is a popular approach to program optimization: it automatically finds good configurations of a program’s so-called tuning parameters whose values are crucial for achieving high performance for a particular parallel architecture and characteristics of input/output data. We present three new contributions of the Auto-Tuning Framework (ATF), which enable a key advantage in general-purpose auto-tuning: efficiently optimizing programs whose tuning parameters have interdependencies among them. We make the following contributions to the three main phases of general-purpose auto-tuning: (1) ATF generates the search space of interdependent tuning parameters with high performance by efficiently exploiting parameter constraints; (2) ATF stores such search spaces efficiently in memory, based on a novel chain-of-trees search space structure; (3) ATF explores these search spaces faster, by employing a multi-dimensional search strategy on its chain-of-trees search space representation. Our experiments demonstrate that, compared to the state-of-the-art, general-purpose auto-tuning frameworks, ATF substantially improves generating, storing, and exploring the search space of interdependent tuning parameters, thereby enabling an efficient overall auto-tuning process for important applications from popular domains, including stencil computations, linear algebra routines, quantum chemistry computations, and data mining algorithms.

References

M. Ahmad and O. Khan. 2016. GPU concurrency choices in graph analytics. In 2016 IEEE International Symposium on Workload Characterization (IISWC’16). 1--10.Google Scholar
Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O’Reilly, and Saman Amarasinghe. 2014. OpenTuner: An extensible framework for program autotuning. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation. ACM, 303--316.Google ScholarDigital Library
Sunil Arya, David M. Mount, Nathan S. Netanyahu, Ruth Silverman, and Angela Y. Wu. 1998. An optimal algorithm for approximate nearest neighbor searching fixed dimensions. J. ACM 45, 6 (Nov. 1998), 891--923. DOI:https://doi.org/10.1145/293347.293348Google ScholarDigital Library
ATF Artifact Implementation. 2020. Retrieved from https://gitlab.com/mdh-project/taco2020-atf.Google Scholar
P. Balaprakash, J. Dongarra, T. Gamblin, M. Hall, J. K. Hollingsworth, B. Norris, and R. Vuduc. 2018. Autotuning in high-performance computing applications. Proc. IEEE 106, 11 (Nov. 2018), 2068--2083. DOI:https://doi.org/10.1109/JPROC.2018.2841200Google ScholarCross Ref
Protonu Basu, Mary Hall, Malik Khan, Suchit Maindola, Saurav Muralidharan, Shreyas Ramalingam, Axel Rivera, Manu Shantharam, and Anand Venkat. 2013. Towards making autotuning mainstream. Int. J. High Performance Comput. Appl. 27, 4 (2013), 379--393. DOI:https://doi.org/10.1177/1094342013493644Google ScholarDigital Library
Gerald Baumgartner, Alexander Auer, David E. Bernholdt, Alina Bibireata, Venkatesh Choppella, Daniel Cociorva, Xiaoyang Gao, Robert J. Harrison, So Hirata, Sriram Krishnamoorthy, et al. 2005. Synthesis of high-performance parallel programs for a class of ab initio quantum chemistry models. Proc. IEEE 93, 2 (2005), 276--292.Google ScholarCross Ref
David Beckingsale, Olga Pearce, Ignacio Laguna, and Todd Gamblin. 2017. Apollo: Reusable models for fast, dynamic tuning of input-dependent code. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS’17). IEEE, 307--316.Google ScholarCross Ref
Joāo M. P. Cardoso, Tiago Carvalho, José G. F. Coutinho, Ricardo Nobre, Razvan Nane, Pedro C. Diniz, Zlatko Petrov, Wayne Luk, and Koen Bertels. 2013. Controlling a complete hardware synthesis toolchain with LARA aspects. Microprocess. Microsyst. 37, 8, Part C (2013), 1073--1089. DOI:https://doi.org/10.1016/j.micpro.2013.06.001 Special Issue on European Projects in Embedded System Design: EPESD2012.Google Scholar
Cedric Nugteren. 2020. CLTune Issue. Retrieved from https://github.com/CNugteren/CLTune/blob/master/src/searchers/annealing.cc#L134 (commit: 2b49667).Google Scholar
Chun Chen, Jacqueline Chame, and Mary Hall. 2008. CHiLL: A Framework for Composing High-Level Loop Transformations. Technical Report. Citeseer. 0--27 pages.Google Scholar
Matthias Christen, Olaf Schenk, and Helmar Burkhart. 2011. PATUS: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In 2011 IEEE International Parallel 8 Distributed Processing Symposium. IEEE, 676--687.Google Scholar
Marco Cianfriglia, Flavio Vella, Cedric Nugteren, Anton Lokhmotov, and Grigori Fursin. 2018. A model-driven approach for a new generation of adaptive libraries. CoRR abs/1806.07060 (2018), 14 pp. arxiv:1806.07060 http://arxiv.org/abs/1806.07060.Google Scholar
T. Daniel Crawford and Henry F. Schaefer. 2000. An introduction to coupled cluster theory for computational chemists. Revi. Comput. Chem. 14 (2000), 33--136.Google Scholar
Christophe Dubach, John Cavazos, Björn Franke, Grigori Fursin, Michael F. P. O’Boyle, and Olivier Temam. 2007. Fast compiler optimisation evaluation using code-feature based performance prediction. In Proceedings of the 4th International Conference on Computing Frontiers. ACM, 131--142.Google ScholarDigital Library
Matteo Frigo and Steven G. Johnson. 2005. The design and implementation of FFTW3. Proc. IEEE 93, 2 (2005), 216--231.Google ScholarCross Ref
Grigori Fursin, Yuriy Kashnikov, Abdul Wahid Memon, Zbigniew Chamski, Olivier Temam, Mircea Namolaru, Elad Yom-Tov, Bilha Mendelson, Ayal Zaks, Eric Courtois, et al. 2011. Milepost GCC: Machine learning enabled self-tuning compile. Int. J. Parallel Program. 39, 3 (2011), 296--327.Google ScholarCross Ref
Bastian Hagedorn, Larisa Stoltzfus, Michel Steuwer, Sergei Gorlatch, and Christophe Dubach. 2018. High performance stencil code generation with lift. In Proceedings of the 2018 International Symposium on Code Generation and Optimization (CGO’18). ACM, New York, NY, 100--112. DOI:https://doi.org/10.1145/3168824Google ScholarDigital Library
Albert Hartono, Boyana Norris, and Ponnuswamy Sadayappan. 2009. Annotation-based empirical performance tuning using Orio. In 2009 IEEE International Symposium on Parallel 8 Distributed Processing. IEEE, 1--11.Google ScholarDigital Library
K. Hentschel et al. 2008. Das Krebsregister-Manual der Gesellschaft der epidemiologischen Krebsregister in Deutschland e.V. Zuckschwerdt Verlag.Google Scholar
Intel. 2020. Math Kernel Library. Retrieved from https://software.intel.com/en-us/mkl.Google Scholar
Intel. 2020. Math Kernel Library for Deep Learning Networks. Retrieved from https://software.intel.com/en-us/articles/intel-mkl-dnn-part-1-library-overview-and-installation.Google Scholar
ISO/IEC. 2017. ISO international standard ISO/IEC 14882:2017—Programming language C++.Google Scholar
B. Janßen, F. Schwiegelshohn, M. Koedam, F. Duhem, L. Masing, S. Werner, C. Huriaux, A. Courtay, E. Wheatley, K. Goossens, F. Lemonnier, P. Millet, J. Becker, O. Sentieys, and M. Hübner. 2015. Designing applications for heterogeneous many-core architectures with the FlexTiles Platform. In 2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS’15). 254--261.Google Scholar
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia. ACM, 675--678.Google ScholarDigital Library
Z. Jia, C. Xue, G. Chen, J. Zhan, L. Zhang, Y. Lin, and P. Hofstee. 2016. Auto-tuning Spark big data workloads on POWER8: Prediction-based dynamic SMT threading. In 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT’16). 387--400.Google Scholar
K. Kaszyk, H. Wagstaff, T. Spink, B. Franke, M. O’Boyle, B. Bodin, and H. Uhrenholt. 2019. Full-system simulation of mobile CPU/GPU platforms. In 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’19). 68--78. DOI:https://doi.org/10.1109/ISPASS.2019.00015Google ScholarCross Ref
A. E. Kiasari, Z. Lu, and A. Jantsch. 2013. An analytical latency model for networks-on-chip. IEEE Trans. Very Large Scale Integration (VLSI) Syst. 21, 1 (2013), 113--123.Google ScholarDigital Library
Jinsung Kim, Aravind Sukumaran-Rajam, Vineeth Thumma, Sriram Krishnamoorthy, Ajay Panyala, Louis-Noël Pouchet, Atanas Rountev, and P. Sadayappan. 2019. A code generator for high-performance tensor contractions on GPUs. In Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO’19). IEEE Press, Piscataway, NJ, 85--95. http://dl.acm.org/citation.cfm?id=3314872.3314885.Google Scholar
Patrick Koch, Oleg Golovidov, Steven Gardner, Brett Wujek, Joshua Griffin, and Yan Xu. 2018. Autotune: A derivative-free optimization framework for hyperparameter tuning. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery 8 Data Mining (KDD’18). Association for Computing Machinery, New York, NY, 443--452. DOI:https://doi.org/10.1145/3219819.3219837Google ScholarDigital Library
Bastian Köpcke, Michel Steuwer, and Sergei Gorlatch. 2019. Generating efficient FFT GPU code with lift. In Proceedings of the 8th ACM SIGPLAN International Workshop on Functional High-Performance and Numerical Computing (FHPNC’19). ACM, New York, NY, 1--13. DOI:https://doi.org/10.1145/3331553.3342613Google ScholarDigital Library
Prasad Kulkarni, Stephen Hines, Jason Hiser, David Whalley, Jack Davidson, and Douglas Jones. 2004. Fast searches for effective optimization phase sequences. In Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation (PLDI’04). Association for Computing Machinery, New York, NY, 171--182. DOI:https://doi.org/10.1145/996841.996863Google ScholarDigital Library
Junjie Lai and André Seznec. 2012. Bound the peak performance of SGEMM on GPU with software-controlled fast memory. [Research Report] RR-7923, 2012. hal-00686006v1.Google Scholar
John Lawson, Mehdi Goli, Duncan McBain, Daniel Soutar, and Louis Sugy. 2019. Cross-platform performance portability using highly parametrized SYCL kernels. CoRR abs/1904.05347 (2019), 11 pp. arxiv:1904.05347 http://arxiv.org/abs/1904.05347Google Scholar
Alberto Magni, Dominik Grewe, and Nick Johnson. 2013. Input-aware auto-tuning for directive-based GPU programming. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units (GPGPU-6). Association for Computing Machinery, New York, NY, 66--75. DOI:https://doi.org/10.1145/2458523.2458530Google ScholarDigital Library
Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. 2015. PolyMage: Automatic optimization for image processing pipelines. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’15). Association for Computing Machinery, New York, NY, 429--443. DOI:https://doi.org/10.1145/2694344.2694364Google ScholarDigital Library
Saurav Muralidharan, Manu Shantharam, Mary Hall, Michael Garland, and Bryan Catanzaro. 2014. Nitro: A framework for adaptive code variant tuning. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium. IEEE, 501--512.Google ScholarDigital Library
T. Nelson, A. Rivera, P. Balaprakash, M. Hall, P. D. Hovland, E. Jessup, and B. Norris. 2015. Generating efficient tensor contractions for GPUs. In 2015 44th International Conference on Parallel Processing. 969--978.Google Scholar
Gustavo Niemeyer. 2018. Python-constraint. Retrieved from https://pypi.org/project/python-constraint/.Google Scholar
Cedric Nugteren. 2018. CLBlast: A tuned OpenCL BLAS library. In Proceedings of the International Workshop on OpenCL. ACM, 1--10.Google ScholarDigital Library
Cedric Nugteren and Valeriu Codreanu. 2015. CLTune: A generic auto-tuner for OpenCL kernels. In 2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip. IEEE, 195--202.Google ScholarDigital Library
NVIDIA. 2020. cuBLAS library. Retrieved from https://developer.nvidia.com/cublas.Google Scholar
NVIDIA. 2020. CUDA C++ Best Practices Guide. Retrieved from https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html.Google Scholar
NVIDIA. 2020. CUDA®Deep Neural Network library. Retrieved from https://developer.nvidia.com/cudnn.Google Scholar
OpenTuner. 2018. Interdependent Tuning Parameters (Issue 106). Retrieved from https://github.com/jansel/opentuner/issues/106.Google Scholar
Philip Pfaffe, Tobias Grosser, and Martin Tillmann. 2019. Efficient hierarchical online-autotuning: A case study on polyhedral accelerator mapping. In Proceedings of the ACM International Conference on Supercomputing (ICS’19). ACM, New York, NY, 354--366. DOI:https://doi.org/10.1145/3330345.3330377Google ScholarDigital Library
Markus Puschel, José M. F. Moura, Jeremy R. Johnson, David Padua, Manuela M. Veloso, Bryan W. Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, et al. 2005. SPIRAL: Code generation for DSP transforms. Proc. IEEE 93, 2 (2005), 232--275.Google ScholarCross Ref
Ari Rasch and Sergei Gorlatch. 2018. Multi-dimensional homomorphisms and their implementation in OpenCL. Int. J. Parallel Program. 46, 1 (01 Feb. 2018), 101--119. DOI:https://doi.org/10.1007/s10766-017-0508-zGoogle ScholarDigital Library
Ari Rasch and Sergei Gorlatch. 2019. ATF: A generic, directive-based auto-tuning framework. Concurrency Comput.: Pract. Exper. 31, 5 (2019), 1--14.Google ScholarCross Ref
A. Rasch, M. Haidl, and S. Gorlatch. 2017. ATF: A generic auto-tuning framework. In 2017 IEEE 19th International Conference on High Performance Computing and Communications; IEEE 15th International Conference on Smart City; IEEE 3rd International Conference on Data Science and Systems (HPCC/SmartCity/DSS). 64--71. DOI:https://doi.org/10.1109/HPCC-SmartCity-DSS.2017.9Google Scholar
A. Rasch, R. Schulze, and S. Gorlatch. 2019. Generating portable high-performance code via multi-dimensional homomorphisms. In 28th International Conference on Parallel Architectures and Compilation Techniques (PACT’19). 354--369.Google Scholar
Ari Rasch, Richard Schulze, Waldemar Gorus, Jan Hiller, Sebastian Bartholomäus, and Sergei Gorlatch. 2019. High-performance probabilistic record linkage via multi-dimensional homomorphisms. In Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing (SAC’19). Association for Computing Machinery, New York, NY, 526--533. DOI:https://doi.org/10.1145/3297280.3297330Google ScholarDigital Library
Simon Rovder, José Cano, and Michael O’Boyle. 2019. Optimising convolutional neural networks inference on low-powered GPUs. In 12th International Workshop on Programmability and Architectures for Heterogeneous Multicores (MULTIPROG-2019). 14 pp.Google Scholar
D. Schaa and D. Kaeli. 2009. Exploring the multiple-GPU design space. In 2009 IEEE International Symposium on Parallel Distributed Processing. 1--12.Google Scholar
Mohammed Sourouri, Espen Birger Raknes, Nico Reissmann, Johannes Langguth, Daniel Hackenberg, Robert Schöne, and Per Gunnar Kjeldsberg. 2017. Towards fine-grained dynamic tuning of HPC applications on modern multi-core architectures. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 1--12.Google ScholarDigital Library
Akshitha Sriraman and Thomas F. Wenisch. 2018. µTune: Auto-tuned threading for OLDI microservices. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18). USENIX Association, Carlsbad, CA, 177--194. https://www.usenix.org/conference/osdi18/presentation/sriraman.Google Scholar
Per Stenström and Jonas Skeppstedt. 1997. A performance tuning approach for shared-memory multiprocessors. In Euro-Par’97 Parallel Processing, Christian Lengauer, Martin Griebl, and Sergei Gorlatch (Eds.). Springer, Berlin, 72--83.Google Scholar
Larisa Stoltzfus, Bastian Hagedorn, Michel Steuwer, Sergei Gorlatch, and Christophe Dubach. 2019. Tiling optimizations for stencil computations using rewrite rules in lift. ACM Trans. Archit. Code Optim. 16, 4, (Dec. 2019), Article 52, 25 pages. DOI:https://doi.org/10.1145/3368858Google ScholarDigital Library
Huihui Sun, Florian Fey, Jie Zhao, and Sergei Gorlatch. 2019. WCCV: Improving the vectorization of IF-statements with warp-coherent conditions. In Proceedings of the ACM International Conference on Supercomputing (ICS’19). ACM, New York, NY, 319--329. DOI:https://doi.org/10.1145/3330345.3331059Google ScholarDigital Library
X. Tang, A. Pattnaik, H. Jiang, O. Kayiran, A. Jog, S. Pai, M. Ibrahim, M. T. Kandemir, and C. R. Das. 2017. Controlled kernel launch for dynamic parallelism in GPUs. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA’17). 649--660. DOI:https://doi.org/10.1109/HPCA.2017.14Google ScholarCross Ref
Thiago SFX Teixeira, William Gropp, and David Padua. 2019. Managing code transformations for better performance portability. Int. J. High Performance Comput. Appl. 33, 6 (2019), 1290--1306.Google ScholarDigital Library
Thiago S. F. X. Teixeira, Corinne Ancourt, David Padua, and William Gropp. 2019. Locus: A system and a language for program optimization. In Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO’19). IEEE Press, Piscataway, NJ, 217--228.Google ScholarCross Ref
Philippe Tillet and David Cox. 2017. Input-aware auto-tuning of compute-bound HPC kernels. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 1--12.Google ScholarDigital Library
Philippe Tillet, H. T. Kung, and David Cox. 2019. Triton: An intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL’19). ACM, New York, NY, 10--19. DOI:https://doi.org/10.1145/3315508.3329973Google ScholarDigital Library
Ananta Tiwari, Vahid Tabatabaee, and Jeffrey K. Hollingsworth. 2009. Tuning parallel applications in parallel. Parallel Comput. 35, 8 (2009), 475--492. DOI:https://doi.org/10.1016/j.parco.2009.07.001Google ScholarDigital Library
Ben van Werkhoven. 2019. Kernel tuner: A search-optimizing GPU code auto-tuner. Future Gen. Comput. Syst. 90 (2019), 347--358. DOI:https://doi.org/10.1016/j.future.2018.08.004Google ScholarCross Ref
Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary Devito, William S. Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. 2019. The next 700 accelerated layers: From mathematical expressions of network computation graphs to accelerated GPU kernels, automatically. ACM Trans. Archit. Code Optim. 16, 4 (Oct. 2019), Article 38, 26 pages. DOI:https://doi.org/10.1145/3355606Google ScholarDigital Library
N. Vijaykumar, K. Hsieh, G. Pekhimenko, S. Khan, A. Shrestha, S. Ghose, A. Jog, P. B. Gibbons, and O. Mutlu. 2016. Zorua: A holistic approach to resource virtualization in GPUs. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). 1--14.Google Scholar
R. Clinton Whaley and Jack J. Dongarra. 1998. Automatically tuned linear algebra software. In SC’98: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing. IEEE, 38.Google Scholar
Stephen Wright and Jorge Nocedal. 1999. Numerical optimization. Springer Sci. 35, 67–68 (1999), 7.Google Scholar
Vasileios Zois, Divya Gupta, Vassilis J. Tsotras, Walid A. Najjar, and Jean-Francois Roy. 2018. Massively parallel skyline computation for processing-in-memory architectures. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques (PACT’18). Association for Computing Machinery, New York, NY, Article 1, 12 pages. DOI:https://doi.org/10.1145/3243176.3243187Google ScholarDigital Library

Index Terms

Efficient Auto-Tuning of Parallel Programs with Interdependent Tuning Parameters via Auto-Tuning Framework (ATF)

Recommendations

Auto-tuning SkePU: a multi-backend skeleton programming framework for multi-GPU systems
IWMSE '11: Proceedings of the 4th International Workshop on Multicore Software Engineering

SkePU is a C++ template library that provides a simple and unified interface for specifying data-parallel computations with the help of skeletons on GPUs using CUDA and OpenCL. The interface is also general enough to support other architectures, and ...
Read More
Matrix multiplication beyond auto-tuning: rewrite-based GPU code generation
CASES '16: Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems

Graphics Processing Units (GPUs) are used as general purpose parallel accelerators in a wide range of applications. They are found in most computing systems, and mobile devices are no exception. The recent availability of programming APIs such as OpenCL ...
Read More
Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis

OpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Architecture and Code Optimization Volume 18, Issue 1
March 2021
402 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3446348
Editor:
David Kaeli
Northeastern University, USA
Issue’s Table of Contents
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 January 2021
- Revised: 1 September 2020
- Accepted: 1 September 2020
- Received: 1 May 2020
Published in taco Volume 18, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Auto-tuning
interdependent tuning parameters
parallel programs
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 13
  Total Citations
  View Citations
- 1,232
  Total Downloads
- Downloads (Last 12 months)427
- Downloads (Last 6 weeks)49
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Efficient Auto-Tuning of Parallel Programs with Interdependent Tuning Parameters via Auto-Tuning Framework (ATF)

ACM Transactions on Architecture and Code Optimization

Abstract

References

Cited By

Index Terms

Recommendations

Auto-tuning SkePU: a multi-backend skeleton programming framework for multi-GPU systems

Matrix multiplication beyond auto-tuning: rewrite-based GPU code generation

Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs