Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Filters








417 Hits in 4.9 sec

Auto-Tuning CUDA Parameters for Sparse Matrix-Vector Multiplication on GPUs

Ping Guo, Liqiang Wang
2010 2010 International Conference on Computational and Information Sciences  
The sparse matrix-vector multiplication (SpMV) is a critical operation in a wide variety of scientific and engineering applications, such as sparse linear algebra and image processing.  ...  This paper presents an auto-tuning framework that can automatically compute and select CUDA parameters for SpMV to obtain the optimal performance on specific GPUs.  ...  Fig. 1 shows an example for a widely-used sparse matrix format called CSR (Compressed Sparse Row) or CRS (Compressed Row Storage).  ... 
doi:10.1109/iccis.2010.285 fatcat:tzjmvn6hmzeu3eu2wq4uz2tqgi

Auto-SpMV: Automated Optimizing SpMV Kernels on GPU [article]

Mina Ashoury and Mohammad Loni and Farshad Khunjush and Masoud Daneshtalab
2023 arXiv   pre-print
Sparse matrix-vector multiplication (SpMV) is an essential linear algebra operation that dominates the computing cost in many scientific applications.  ...  In the compile-time mode, Auto-SpMV tweaks the compilation parameters, while in the run-time mode, Auto-SpMV selects the best sparse format for the sparse input matrix.  ...  SpMV Processing on GPU The Sparse Matrix-Vector Multiplication (SpMV) is a non-trivial Level-2 BLAS operation that finds the dense vector product Y of a sparse matrix A and a dense vector X such that Y  ... 
arXiv:2302.05662v1 fatcat:xe4in7wlljhr5ffoce6jzomika

A model-driven partitioning and auto-tuning integrated framework for sparse matrix-vector multiplication on GPUs

Ping Guo, He Huang, Qichang Chen, Liqiang Wang, En-Jui Lee, Po Chen
2011 Proceedings of the 2011 TeraGrid Conference on Extreme Digital Discovery - TG '11  
Sparse Matrix-Vector Multiplication (SpMV) is very common to scientific computing.  ...  This paper presents an innovative performance-model driven approach for partitioning sparse matrix into appropriate formats, and auto-tuning configurations of CUDA kernels to improve the performance of  ...  For CSR vector kernel, one warp (32 threads) is responsible for computing the multiplication of one row of the matrix and the vector.  ... 
doi:10.1145/2016741.2016744 fatcat:yukl7bdijjfc7gqlrx5sv2k7wy

Performance Enhancement Strategies for Sparse Matrix-Vector Multiplication (SpMV) and Iterative Linear Solvers [article]

Thaha Mohammed, Rashid Mehmood
2022 arXiv   pre-print
The crucial computing kernel for such iterative solutions is the multiplication of a sparse matrix by a dense vector.  ...  Efficient implementation of sparse matrix-vector multiplication (SpMV) and linear solvers are therefore essential and has been subjected to extensive research across a variety of computing architectures  ...  Acknowledgement The authors acknowledge with thanks the technical and financial support from the Deanship of Scientific Research (DSR) at the King Abdulaziz University (KAU), Jeddah, Saudi Arabia, under  ... 
arXiv:2212.07490v1 fatcat:3jtf7lhjtbgdhnozrgx3k27wiu

Implementing Sparse Matrix-Vector Multiplication with QCSR on GPU

Jilin Zhang, Enyi Liu, Jian Wan, Yongjian Ren, Miao Yue, Jue Wang
2013 Applied Mathematics & Information Sciences  
Various data formats to store the sparse matrix have been implemented on GPUs to maximize the performance.  ...  However, it is still challenging to optimize computations with irregular data access patterns like sparse matrix-vector multiplication (SPMV).  ...  Some of the first work on sparse matrix-vector multiplication on GPU architectures was by Bolz et al [26] .  ... 
doi:10.12785/amis/070207 fatcat:okhgxfolzjbp5mb4tevbzf25bq

GHOST: Building Blocks for High Performance Sparse Linear Algebra on Heterogeneous Systems

Moritz Kreutzer, Jonas Thies, Melven Röhrig-Zöllner, Andreas Pieper, Faisal Shahzad, Martin Galgon, Achim Basermann, Holger Fehske, Georg Hager, Gerhard Wellein
2016 International journal of parallel programming  
The "General, Hybrid, and Optimized Sparse Toolkit" (GHOST) is a collection of building blocks that targets algorithms dealing with sparse matrix representations on current and future large-scale systems  ...  GPUs, and the Intel Xeon Phi.  ...  Special thanks go to Andreas Alvermann for providing sparse matrix generation functions for testing and everyone else who contributed to GHOST, directly or indirectly.  ... 
doi:10.1007/s10766-016-0464-z fatcat:w64ypr4otfgcxbfftzzuko5nim

Exact sparse matrix-vector multiplication on GPU's and multicore architectures

Brice Boyer, Jean-Guillaume Dumas, Pascal Giorgi
2010 Proceedings of the 4th International Workshop on Parallel and Symbolic Computation - PASCO '10  
We propose different implementations of the sparse matrix--dense vector multiplication () for finite fields and rings /m.  ...  We take advantage of graphic card processors (GPU) and multi-core architectures. Our aim is to improve the speed of in the library, and henceforth the speed of its black box algorithms.  ...  On the numerical side, several libraries automatically tune the sparse matrix kernels [19, 20, 16] and recently some kernels have been proposed e.g. for GPU's [17, 2, 1] .  ... 
doi:10.1145/1837210.1837224 dblp:conf/cap/BoyerDG10 fatcat:kpbem74jmre2lkrfutypof7lxi

The challenges of writing portable, correct and high performance libraries for GPUs

Miriam Leeser, Devon Yablonski, Dana Brooks, Laurie Smith King
2011 SIGARCH Computer Architecture News  
Specifically we target the linear solver module, including Conjugate Gradient, Jacobi and MinRes solvers for sparse matrices.  ...  This problem requires a different set of tradeoffs than finding the best runtime for a single solution.  ...  We thank SCI Institute personnel Jeroen Stinstra, Darrell Swenson (for guidance and test problems), and especially Ayla Khan (for help integrating our work with the SCI tools).  ... 
doi:10.1145/2082156.2082158 pmid:23807820 pmcid:PMC3691863 fatcat:lk22rfzsv5cozny52kuspmldqm

Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi [chapter]

Erik Saule, Kamer Kaya, Ümit V. Çatalyürek
2014 Lecture Notes in Computer Science  
The computation, known as the sparse-matrix vector multiplication (SpMV), and with some variants, such as the sparse-matrix matrix multiplication (SpMM), they form the computational core of many applications  ...  The core of many scientific applications involves the multiplication of a large, sparse matrix with a single or multiple dense vectors which are not compute-bound but memory-bound.  ...  An m×n sparse matrix A with τ nonzeros is usually stored in the compressed row storage format CRS which uses three arrays: -cids[.] is an integer array of size τ that stores the column ids for each nonzero  ... 
doi:10.1007/978-3-642-55224-3_52 fatcat:vwfpsy3ur5ahznfdoicggp26dq

RT-CUDA: A Software Tool for CUDA Code Restructuring

Ayaz H. Khan, Mayez Al-Mouhamed, Muhammed Al-Mulhem, Adel F. Ahmed
2016 International journal of parallel programming  
A blocked sparse matrix-vector multiplication for NVIDIA GPUs [10] has been implemented.  ...  An optimized version of Sparse Matrix Vector (SpMV) Multiplication [27] has been implemented on NVIDIA GPUs using CUDA.  ...  modification The modification on the code has no effect on the parameters for Jacobi solver  ... 
doi:10.1007/s10766-016-0433-6 fatcat:xxikpkyrkvdizgijskmk2qxkay

Software for Sparse Tensor Decomposition on Emerging Computing Architectures [article]

Eric Phipps, Tamara G. Kolda
2019 arXiv   pre-print
In this paper, we develop software for decomposing sparse tensors that is portable to and performant on a variety of multicore, manycore, and GPU computing architectures.  ...  Not only are the specifics of our approaches and implementation interesting for tuning tensor computations, but they also provide a roadmap for developing other portable high-performance codes.  ...  We thank the anonymous referees for critical feedback on this work, resulting in substantial improvements in the presentation.  ... 
arXiv:1809.09175v2 fatcat:eflkr3ocmjgtrfizp6gn4jti6y

Performance Engineering of the Kernel Polynomal Method on Large-Scale CPU-GPU Systems

Moritz Kreutzer, Andreas Pieper, Georg Hager, Gerhard Wellein, Andreas Alvermann, Holger Fehske
2015 2015 IEEE International Parallel and Distributed Processing Symposium  
To alleviate the effects of scattered data access we combine loosely coupled outer iterations with tightly coupled block sparse matrix multiple vector operations, which enables pure data streaming.  ...  At the node level we show that it is possible to decouple the sparse matrix problem posed by KPM from main memory bandwidth both on CPU and GPU.  ...  ACKNOWLEDGMENTS We are indebted to the Swiss National Computing Centre for granting access to Piz Daint.  ... 
doi:10.1109/ipdps.2015.76 dblp:conf/ipps/KreutzerPHWAF15 fatcat:afbnmvlqlrgb5iia35rwupxmma

DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions [chapter]

Daichi Mukunoki, Katsuhisa Ozaki, Takeshi Ogita, Toshiyuki Imamura
2020 Lecture Notes in Computer Science  
This paper proposes a method for implementing dense matrix multiplication on FP64 (DGEMM) and FP32 (SGEMM) using Tensor Cores on NVIDIA's graphics processing units (GPUs).  ...  The proposed method adopts the Ozaki scheme, an accurate matrix multiplication algorithm based on error-free transformation for matrix multiplication.  ...  This research was partially supported by the Japan Society for the Promotion of Science (JSPS) KAKENHI Grant Number 19K20286 and MEXT as "Exploratory Issue on Post-K computer" (Development of verified  ... 
doi:10.1007/978-3-030-50743-5_12 fatcat:wanjoziydvf4hlmz4gjk7om3aq

Kokkos Kernels: Performance Portable Sparse/Dense Linear Algebra and Graph Kernels [article]

Sivasankaran Rajamanickam, Seher Acer, Luc Berger-Vergiat, Vinh Dang, Nathan Ellingwood, Evan Harvey, Brian Kelley, Christian R. Trott, Jeremiah Wilke, Ichitaro Yamazaki
2021 arXiv   pre-print
We describe Kokkos Kernels, a library of kernels for sparse linear algebra, dense linear algebra and graph kernels.  ...  Specifically, we demonstrate the performance of four sparse kernels, three dense batched kernels, two graph kernels and one team level algorithm.  ...  ACKNOWLEDGMENTS We would like to acknowledge the contributions of several "alumni" and friends of the Kokkos Kernels library  ... 
arXiv:2103.11991v1 fatcat:m7iskgt5kjdjnex7lenqsjj6z4

ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA [article]

Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, Huazhong Yang, William J. Dally
2017 arXiv   pre-print
Evaluated on the LSTM for speech recognition benchmark, ESE is 43x and 3x faster than Core i7 5930k CPU and Pascal Titan X GPU implementations.  ...  a full LSTM for speech recognition with a power dissipation of 41 Watts.  ...  This work was supported by National Natural Science Foundation of China (No.61373026, 61622403, 61261160501).  ... 
arXiv:1612.00694v2 fatcat:a65wy2piqnezjjjmauub7rdlai
« Previous Showing results 1 — 15 out of 417 results