Abstract
Auto-tuning is a popular approach to program optimization: it automatically finds good configurations of a program’s so-called tuning parameters whose values are crucial for achieving high performance for a particular parallel architecture and characteristics of input/output data. We present three new contributions of the Auto-Tuning Framework (ATF), which enable a key advantage in general-purpose auto-tuning: efficiently optimizing programs whose tuning parameters have interdependencies among them. We make the following contributions to the three main phases of general-purpose auto-tuning: (1) ATF generates the search space of interdependent tuning parameters with high performance by efficiently exploiting parameter constraints; (2) ATF stores such search spaces efficiently in memory, based on a novel chain-of-trees search space structure; (3) ATF explores these search spaces faster, by employing a multi-dimensional search strategy on its chain-of-trees search space representation. Our experiments demonstrate that, compared to the state-of-the-art, general-purpose auto-tuning frameworks, ATF substantially improves generating, storing, and exploring the search space of interdependent tuning parameters, thereby enabling an efficient overall auto-tuning process for important applications from popular domains, including stencil computations, linear algebra routines, quantum chemistry computations, and data mining algorithms.
- M. Ahmad and O. Khan. 2016. GPU concurrency choices in graph analytics. In 2016 IEEE International Symposium on Workload Characterization (IISWC’16). 1--10.Google Scholar
- Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O’Reilly, and Saman Amarasinghe. 2014. OpenTuner: An extensible framework for program autotuning. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation. ACM, 303--316.Google ScholarDigital Library
- Sunil Arya, David M. Mount, Nathan S. Netanyahu, Ruth Silverman, and Angela Y. Wu. 1998. An optimal algorithm for approximate nearest neighbor searching fixed dimensions. J. ACM 45, 6 (Nov. 1998), 891--923. DOI:https://doi.org/10.1145/293347.293348Google ScholarDigital Library
- ATF Artifact Implementation. 2020. Retrieved from https://gitlab.com/mdh-project/taco2020-atf.Google Scholar
- P. Balaprakash, J. Dongarra, T. Gamblin, M. Hall, J. K. Hollingsworth, B. Norris, and R. Vuduc. 2018. Autotuning in high-performance computing applications. Proc. IEEE 106, 11 (Nov. 2018), 2068--2083. DOI:https://doi.org/10.1109/JPROC.2018.2841200Google ScholarCross Ref
- Protonu Basu, Mary Hall, Malik Khan, Suchit Maindola, Saurav Muralidharan, Shreyas Ramalingam, Axel Rivera, Manu Shantharam, and Anand Venkat. 2013. Towards making autotuning mainstream. Int. J. High Performance Comput. Appl. 27, 4 (2013), 379--393. DOI:https://doi.org/10.1177/1094342013493644Google ScholarDigital Library
- Gerald Baumgartner, Alexander Auer, David E. Bernholdt, Alina Bibireata, Venkatesh Choppella, Daniel Cociorva, Xiaoyang Gao, Robert J. Harrison, So Hirata, Sriram Krishnamoorthy, et al. 2005. Synthesis of high-performance parallel programs for a class of ab initio quantum chemistry models. Proc. IEEE 93, 2 (2005), 276--292.Google ScholarCross Ref
- David Beckingsale, Olga Pearce, Ignacio Laguna, and Todd Gamblin. 2017. Apollo: Reusable models for fast, dynamic tuning of input-dependent code. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS’17). IEEE, 307--316.Google ScholarCross Ref
- Joāo M. P. Cardoso, Tiago Carvalho, José G. F. Coutinho, Ricardo Nobre, Razvan Nane, Pedro C. Diniz, Zlatko Petrov, Wayne Luk, and Koen Bertels. 2013. Controlling a complete hardware synthesis toolchain with LARA aspects. Microprocess. Microsyst. 37, 8, Part C (2013), 1073--1089. DOI:https://doi.org/10.1016/j.micpro.2013.06.001 Special Issue on European Projects in Embedded System Design: EPESD2012.Google Scholar
- Cedric Nugteren. 2020. CLTune Issue. Retrieved from https://github.com/CNugteren/CLTune/blob/master/src/searchers/annealing.cc#L134 (commit: 2b49667).Google Scholar
- Chun Chen, Jacqueline Chame, and Mary Hall. 2008. CHiLL: A Framework for Composing High-Level Loop Transformations. Technical Report. Citeseer. 0--27 pages.Google Scholar
- Matthias Christen, Olaf Schenk, and Helmar Burkhart. 2011. PATUS: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In 2011 IEEE International Parallel 8 Distributed Processing Symposium. IEEE, 676--687.Google Scholar
- Marco Cianfriglia, Flavio Vella, Cedric Nugteren, Anton Lokhmotov, and Grigori Fursin. 2018. A model-driven approach for a new generation of adaptive libraries. CoRR abs/1806.07060 (2018), 14 pp. arxiv:1806.07060 http://arxiv.org/abs/1806.07060.Google Scholar
- T. Daniel Crawford and Henry F. Schaefer. 2000. An introduction to coupled cluster theory for computational chemists. Revi. Comput. Chem. 14 (2000), 33--136.Google Scholar
- Christophe Dubach, John Cavazos, Björn Franke, Grigori Fursin, Michael F. P. O’Boyle, and Olivier Temam. 2007. Fast compiler optimisation evaluation using code-feature based performance prediction. In Proceedings of the 4th International Conference on Computing Frontiers. ACM, 131--142.Google ScholarDigital Library
- Matteo Frigo and Steven G. Johnson. 2005. The design and implementation of FFTW3. Proc. IEEE 93, 2 (2005), 216--231.Google ScholarCross Ref
- Grigori Fursin, Yuriy Kashnikov, Abdul Wahid Memon, Zbigniew Chamski, Olivier Temam, Mircea Namolaru, Elad Yom-Tov, Bilha Mendelson, Ayal Zaks, Eric Courtois, et al. 2011. Milepost GCC: Machine learning enabled self-tuning compile. Int. J. Parallel Program. 39, 3 (2011), 296--327.Google ScholarCross Ref
- Bastian Hagedorn, Larisa Stoltzfus, Michel Steuwer, Sergei Gorlatch, and Christophe Dubach. 2018. High performance stencil code generation with lift. In Proceedings of the 2018 International Symposium on Code Generation and Optimization (CGO’18). ACM, New York, NY, 100--112. DOI:https://doi.org/10.1145/3168824Google ScholarDigital Library
- Albert Hartono, Boyana Norris, and Ponnuswamy Sadayappan. 2009. Annotation-based empirical performance tuning using Orio. In 2009 IEEE International Symposium on Parallel 8 Distributed Processing. IEEE, 1--11.Google ScholarDigital Library
- K. Hentschel et al. 2008. Das Krebsregister-Manual der Gesellschaft der epidemiologischen Krebsregister in Deutschland e.V. Zuckschwerdt Verlag.Google Scholar
- Intel. 2020. Math Kernel Library. Retrieved from https://software.intel.com/en-us/mkl.Google Scholar
- Intel. 2020. Math Kernel Library for Deep Learning Networks. Retrieved from https://software.intel.com/en-us/articles/intel-mkl-dnn-part-1-library-overview-and-installation.Google Scholar
- ISO/IEC. 2017. ISO international standard ISO/IEC 14882:2017—Programming language C++.Google Scholar
- B. Janßen, F. Schwiegelshohn, M. Koedam, F. Duhem, L. Masing, S. Werner, C. Huriaux, A. Courtay, E. Wheatley, K. Goossens, F. Lemonnier, P. Millet, J. Becker, O. Sentieys, and M. Hübner. 2015. Designing applications for heterogeneous many-core architectures with the FlexTiles Platform. In 2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS’15). 254--261.Google Scholar
- Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia. ACM, 675--678.Google ScholarDigital Library
- Z. Jia, C. Xue, G. Chen, J. Zhan, L. Zhang, Y. Lin, and P. Hofstee. 2016. Auto-tuning Spark big data workloads on POWER8: Prediction-based dynamic SMT threading. In 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT’16). 387--400.Google Scholar
- K. Kaszyk, H. Wagstaff, T. Spink, B. Franke, M. O’Boyle, B. Bodin, and H. Uhrenholt. 2019. Full-system simulation of mobile CPU/GPU platforms. In 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’19). 68--78. DOI:https://doi.org/10.1109/ISPASS.2019.00015Google ScholarCross Ref
- A. E. Kiasari, Z. Lu, and A. Jantsch. 2013. An analytical latency model for networks-on-chip. IEEE Trans. Very Large Scale Integration (VLSI) Syst. 21, 1 (2013), 113--123.Google ScholarDigital Library
- Jinsung Kim, Aravind Sukumaran-Rajam, Vineeth Thumma, Sriram Krishnamoorthy, Ajay Panyala, Louis-Noël Pouchet, Atanas Rountev, and P. Sadayappan. 2019. A code generator for high-performance tensor contractions on GPUs. In Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO’19). IEEE Press, Piscataway, NJ, 85--95. http://dl.acm.org/citation.cfm?id=3314872.3314885.Google Scholar
- Patrick Koch, Oleg Golovidov, Steven Gardner, Brett Wujek, Joshua Griffin, and Yan Xu. 2018. Autotune: A derivative-free optimization framework for hyperparameter tuning. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery 8 Data Mining (KDD’18). Association for Computing Machinery, New York, NY, 443--452. DOI:https://doi.org/10.1145/3219819.3219837Google ScholarDigital Library
- Bastian Köpcke, Michel Steuwer, and Sergei Gorlatch. 2019. Generating efficient FFT GPU code with lift. In Proceedings of the 8th ACM SIGPLAN International Workshop on Functional High-Performance and Numerical Computing (FHPNC’19). ACM, New York, NY, 1--13. DOI:https://doi.org/10.1145/3331553.3342613Google ScholarDigital Library
- Prasad Kulkarni, Stephen Hines, Jason Hiser, David Whalley, Jack Davidson, and Douglas Jones. 2004. Fast searches for effective optimization phase sequences. In Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation (PLDI’04). Association for Computing Machinery, New York, NY, 171--182. DOI:https://doi.org/10.1145/996841.996863Google ScholarDigital Library
- Junjie Lai and André Seznec. 2012. Bound the peak performance of SGEMM on GPU with software-controlled fast memory. [Research Report] RR-7923, 2012. hal-00686006v1.Google Scholar
- John Lawson, Mehdi Goli, Duncan McBain, Daniel Soutar, and Louis Sugy. 2019. Cross-platform performance portability using highly parametrized SYCL kernels. CoRR abs/1904.05347 (2019), 11 pp. arxiv:1904.05347 http://arxiv.org/abs/1904.05347Google Scholar
- Alberto Magni, Dominik Grewe, and Nick Johnson. 2013. Input-aware auto-tuning for directive-based GPU programming. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units (GPGPU-6). Association for Computing Machinery, New York, NY, 66--75. DOI:https://doi.org/10.1145/2458523.2458530Google ScholarDigital Library
- Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. 2015. PolyMage: Automatic optimization for image processing pipelines. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’15). Association for Computing Machinery, New York, NY, 429--443. DOI:https://doi.org/10.1145/2694344.2694364Google ScholarDigital Library
- Saurav Muralidharan, Manu Shantharam, Mary Hall, Michael Garland, and Bryan Catanzaro. 2014. Nitro: A framework for adaptive code variant tuning. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium. IEEE, 501--512.Google ScholarDigital Library
- T. Nelson, A. Rivera, P. Balaprakash, M. Hall, P. D. Hovland, E. Jessup, and B. Norris. 2015. Generating efficient tensor contractions for GPUs. In 2015 44th International Conference on Parallel Processing. 969--978.Google Scholar
- Gustavo Niemeyer. 2018. Python-constraint. Retrieved from https://pypi.org/project/python-constraint/.Google Scholar
- Cedric Nugteren. 2018. CLBlast: A tuned OpenCL BLAS library. In Proceedings of the International Workshop on OpenCL. ACM, 1--10.Google ScholarDigital Library
- Cedric Nugteren and Valeriu Codreanu. 2015. CLTune: A generic auto-tuner for OpenCL kernels. In 2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip. IEEE, 195--202.Google ScholarDigital Library
- NVIDIA. 2020. cuBLAS library. Retrieved from https://developer.nvidia.com/cublas.Google Scholar
- NVIDIA. 2020. CUDA C++ Best Practices Guide. Retrieved from https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html.Google Scholar
- NVIDIA. 2020. CUDA®Deep Neural Network library. Retrieved from https://developer.nvidia.com/cudnn.Google Scholar
- OpenTuner. 2018. Interdependent Tuning Parameters (Issue 106). Retrieved from https://github.com/jansel/opentuner/issues/106.Google Scholar
- Philip Pfaffe, Tobias Grosser, and Martin Tillmann. 2019. Efficient hierarchical online-autotuning: A case study on polyhedral accelerator mapping. In Proceedings of the ACM International Conference on Supercomputing (ICS’19). ACM, New York, NY, 354--366. DOI:https://doi.org/10.1145/3330345.3330377Google ScholarDigital Library
- Markus Puschel, José M. F. Moura, Jeremy R. Johnson, David Padua, Manuela M. Veloso, Bryan W. Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, et al. 2005. SPIRAL: Code generation for DSP transforms. Proc. IEEE 93, 2 (2005), 232--275.Google ScholarCross Ref
- Ari Rasch and Sergei Gorlatch. 2018. Multi-dimensional homomorphisms and their implementation in OpenCL. Int. J. Parallel Program. 46, 1 (01 Feb. 2018), 101--119. DOI:https://doi.org/10.1007/s10766-017-0508-zGoogle ScholarDigital Library
- Ari Rasch and Sergei Gorlatch. 2019. ATF: A generic, directive-based auto-tuning framework. Concurrency Comput.: Pract. Exper. 31, 5 (2019), 1--14.Google ScholarCross Ref
- A. Rasch, M. Haidl, and S. Gorlatch. 2017. ATF: A generic auto-tuning framework. In 2017 IEEE 19th International Conference on High Performance Computing and Communications; IEEE 15th International Conference on Smart City; IEEE 3rd International Conference on Data Science and Systems (HPCC/SmartCity/DSS). 64--71. DOI:https://doi.org/10.1109/HPCC-SmartCity-DSS.2017.9Google Scholar
- A. Rasch, R. Schulze, and S. Gorlatch. 2019. Generating portable high-performance code via multi-dimensional homomorphisms. In 28th International Conference on Parallel Architectures and Compilation Techniques (PACT’19). 354--369.Google Scholar
- Ari Rasch, Richard Schulze, Waldemar Gorus, Jan Hiller, Sebastian Bartholomäus, and Sergei Gorlatch. 2019. High-performance probabilistic record linkage via multi-dimensional homomorphisms. In Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing (SAC’19). Association for Computing Machinery, New York, NY, 526--533. DOI:https://doi.org/10.1145/3297280.3297330Google ScholarDigital Library
- Simon Rovder, José Cano, and Michael O’Boyle. 2019. Optimising convolutional neural networks inference on low-powered GPUs. In 12th International Workshop on Programmability and Architectures for Heterogeneous Multicores (MULTIPROG-2019). 14 pp.Google Scholar
- D. Schaa and D. Kaeli. 2009. Exploring the multiple-GPU design space. In 2009 IEEE International Symposium on Parallel Distributed Processing. 1--12.Google Scholar
- Mohammed Sourouri, Espen Birger Raknes, Nico Reissmann, Johannes Langguth, Daniel Hackenberg, Robert Schöne, and Per Gunnar Kjeldsberg. 2017. Towards fine-grained dynamic tuning of HPC applications on modern multi-core architectures. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 1--12.Google ScholarDigital Library
- Akshitha Sriraman and Thomas F. Wenisch. 2018. µTune: Auto-tuned threading for OLDI microservices. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18). USENIX Association, Carlsbad, CA, 177--194. https://www.usenix.org/conference/osdi18/presentation/sriraman.Google Scholar
- Per Stenström and Jonas Skeppstedt. 1997. A performance tuning approach for shared-memory multiprocessors. In Euro-Par’97 Parallel Processing, Christian Lengauer, Martin Griebl, and Sergei Gorlatch (Eds.). Springer, Berlin, 72--83.Google Scholar
- Larisa Stoltzfus, Bastian Hagedorn, Michel Steuwer, Sergei Gorlatch, and Christophe Dubach. 2019. Tiling optimizations for stencil computations using rewrite rules in lift. ACM Trans. Archit. Code Optim. 16, 4, (Dec. 2019), Article 52, 25 pages. DOI:https://doi.org/10.1145/3368858Google ScholarDigital Library
- Huihui Sun, Florian Fey, Jie Zhao, and Sergei Gorlatch. 2019. WCCV: Improving the vectorization of IF-statements with warp-coherent conditions. In Proceedings of the ACM International Conference on Supercomputing (ICS’19). ACM, New York, NY, 319--329. DOI:https://doi.org/10.1145/3330345.3331059Google ScholarDigital Library
- X. Tang, A. Pattnaik, H. Jiang, O. Kayiran, A. Jog, S. Pai, M. Ibrahim, M. T. Kandemir, and C. R. Das. 2017. Controlled kernel launch for dynamic parallelism in GPUs. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA’17). 649--660. DOI:https://doi.org/10.1109/HPCA.2017.14Google ScholarCross Ref
- Thiago SFX Teixeira, William Gropp, and David Padua. 2019. Managing code transformations for better performance portability. Int. J. High Performance Comput. Appl. 33, 6 (2019), 1290--1306.Google ScholarDigital Library
- Thiago S. F. X. Teixeira, Corinne Ancourt, David Padua, and William Gropp. 2019. Locus: A system and a language for program optimization. In Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO’19). IEEE Press, Piscataway, NJ, 217--228.Google ScholarCross Ref
- Philippe Tillet and David Cox. 2017. Input-aware auto-tuning of compute-bound HPC kernels. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 1--12.Google ScholarDigital Library
- Philippe Tillet, H. T. Kung, and David Cox. 2019. Triton: An intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL’19). ACM, New York, NY, 10--19. DOI:https://doi.org/10.1145/3315508.3329973Google ScholarDigital Library
- Ananta Tiwari, Vahid Tabatabaee, and Jeffrey K. Hollingsworth. 2009. Tuning parallel applications in parallel. Parallel Comput. 35, 8 (2009), 475--492. DOI:https://doi.org/10.1016/j.parco.2009.07.001Google ScholarDigital Library
- Ben van Werkhoven. 2019. Kernel tuner: A search-optimizing GPU code auto-tuner. Future Gen. Comput. Syst. 90 (2019), 347--358. DOI:https://doi.org/10.1016/j.future.2018.08.004Google ScholarCross Ref
- Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary Devito, William S. Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. 2019. The next 700 accelerated layers: From mathematical expressions of network computation graphs to accelerated GPU kernels, automatically. ACM Trans. Archit. Code Optim. 16, 4 (Oct. 2019), Article 38, 26 pages. DOI:https://doi.org/10.1145/3355606Google ScholarDigital Library
- N. Vijaykumar, K. Hsieh, G. Pekhimenko, S. Khan, A. Shrestha, S. Ghose, A. Jog, P. B. Gibbons, and O. Mutlu. 2016. Zorua: A holistic approach to resource virtualization in GPUs. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). 1--14.Google Scholar
- R. Clinton Whaley and Jack J. Dongarra. 1998. Automatically tuned linear algebra software. In SC’98: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing. IEEE, 38.Google Scholar
- Stephen Wright and Jorge Nocedal. 1999. Numerical optimization. Springer Sci. 35, 67–68 (1999), 7.Google Scholar
- Vasileios Zois, Divya Gupta, Vassilis J. Tsotras, Walid A. Najjar, and Jean-Francois Roy. 2018. Massively parallel skyline computation for processing-in-memory architectures. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques (PACT’18). Association for Computing Machinery, New York, NY, Article 1, 12 pages. DOI:https://doi.org/10.1145/3243176.3243187Google ScholarDigital Library
Index Terms
- Efficient Auto-Tuning of Parallel Programs with Interdependent Tuning Parameters via Auto-Tuning Framework (ATF)
Recommendations
Auto-tuning SkePU: a multi-backend skeleton programming framework for multi-GPU systems
IWMSE '11: Proceedings of the 4th International Workshop on Multicore Software EngineeringSkePU is a C++ template library that provides a simple and unified interface for specifying data-parallel computations with the help of skeletons on GPUs using CUDA and OpenCL. The interface is also general enough to support other architectures, and ...
Matrix multiplication beyond auto-tuning: rewrite-based GPU code generation
CASES '16: Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded SystemsGraphics Processing Units (GPUs) are used as general purpose parallel accelerators in a wide range of applications. They are found in most computing systems, and mobile devices are no exception. The recent availability of programming APIs such as OpenCL ...
Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and AnalysisOpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes ...
Comments