Automatic Tuning of Sparse Matrix-Vector Multiplication for CRS Format on GPUs.

The sparse matrix-vector multiplication (SpMV) is a critical operation in a wide variety of scientific and engineering applications, such as sparse linear algebra and image processing. ... This paper presents an auto-tuning framework that can automatically compute and select CUDA parameters for SpMV to obtain the optimal performance on specific GPUs. ... Fig. 1 shows an example for a widely-used sparse matrix format called CSR (Compressed Sparse Row) or CRS (Compressed Row Storage). ...

doi:10.1109/iccis.2010.285 fatcat:tzjmvn6hmzeu3eu2wq4uz2tqgi

Sparse matrix-vector multiplication (SpMV) is an essential linear algebra operation that dominates the computing cost in many scientific applications. ... In the compile-time mode, Auto-SpMV tweaks the compilation parameters, while in the run-time mode, Auto-SpMV selects the best sparse format for the sparse input matrix. ... SpMV Processing on GPU The Sparse Matrix-Vector Multiplication (SpMV) is a non-trivial Level-2 BLAS operation that finds the dense vector product Y of a sparse matrix A and a dense vector X such that Y ...

arXiv:2302.05662v1 fatcat:xe4in7wlljhr5ffoce6jzomika

Open Access

Sparse Matrix-Vector Multiplication (SpMV) is very common to scientific computing. ... This paper presents an innovative performance-model driven approach for partitioning sparse matrix into appropriate formats, and auto-tuning configurations of CUDA kernels to improve the performance of ... For CSR vector kernel, one warp (32 threads) is responsible for computing the multiplication of one row of the matrix and the vector. ...

doi:10.1145/2016741.2016744 fatcat:yukl7bdijjfc7gqlrx5sv2k7wy

The crucial computing kernel for such iterative solutions is the multiplication of a sparse matrix by a dense vector. ... Efficient implementation of sparse matrix-vector multiplication (SpMV) and linear solvers are therefore essential and has been subjected to extensive research across a variety of computing architectures ... Acknowledgement The authors acknowledge with thanks the technical and financial support from the Deanship of Scientific Research (DSR) at the King Abdulaziz University (KAU), Jeddah, Saudi Arabia, under ...

arXiv:2212.07490v1 fatcat:3jtf7lhjtbgdhnozrgx3k27wiu

Open Access

Various data formats to store the sparse matrix have been implemented on GPUs to maximize the performance. ... However, it is still challenging to optimize computations with irregular data access patterns like sparse matrix-vector multiplication (SPMV). ... Some of the first work on sparse matrix-vector multiplication on GPU architectures was by Bolz et al [26] . ...

doi:10.12785/amis/070207 fatcat:okhgxfolzjbp5mb4tevbzf25bq

Szczepanski

The "General, Hybrid, and Optimized Sparse Toolkit" (GHOST) is a collection of building blocks that targets algorithms dealing with sparse matrix representations on current and future large-scale systems ... GPUs, and the Intel Xeon Phi. ... Special thanks go to Andreas Alvermann for providing sparse matrix generation functions for testing and everyone else who contributed to GHOST, directly or indirectly. ...

doi:10.1007/s10766-016-0464-z fatcat:w64ypr4otfgcxbfftzzuko5nim

Multiple Versions

We propose different implementations of the sparse matrix--dense vector multiplication () for finite fields and rings /m. ... We take advantage of graphic card processors (GPU) and multi-core architectures. Our aim is to improve the speed of in the library, and henceforth the speed of its black box algorithms. ... On the numerical side, several libraries automatically tune the sparse matrix kernels [19, 20, 16] and recently some kernels have been proposed e.g. for GPU's [17, 2, 1] . ...

doi:10.1145/1837210.1837224 dblp:conf/cap/BoyerDG10 fatcat:kpbem74jmre2lkrfutypof7lxi

Multiple Versions

Specifically we target the linear solver module, including Conjugate Gradient, Jacobi and MinRes solvers for sparse matrices. ... This problem requires a different set of tradeoffs than finding the best runtime for a single solution. ... We thank SCI Institute personnel Jeroen Stinstra, Darrell Swenson (for guidance and test problems), and especially Ayla Khan (for help integrating our work with the SCI tools). ...

doi:10.1145/2082156.2082158 pmid:23807820 pmcid:PMC3691863 fatcat:lk22rfzsv5cozny52kuspmldqm

The computation, known as the sparse-matrix vector multiplication (SpMV), and with some variants, such as the sparse-matrix matrix multiplication (SpMM), they form the computational core of many applications ... The core of many scientific applications involves the multiplication of a large, sparse matrix with a single or multiple dense vectors which are not compute-bound but memory-bound. ... An m×n sparse matrix A with τ nonzeros is usually stored in the compressed row storage format CRS which uses three arrays: -cids[.] is an integer array of size τ that stores the column ids for each nonzero ...

doi:10.1007/978-3-642-55224-3_52 fatcat:vwfpsy3ur5ahznfdoicggp26dq

A blocked sparse matrix-vector multiplication for NVIDIA GPUs [10] has been implemented. ... An optimized version of Sparse Matrix Vector (SpMV) Multiplication [27] has been implemented on NVIDIA GPUs using CUDA. ... modification The modification on the code has no effect on the parameters for Jacobi solver ...

doi:10.1007/s10766-016-0433-6 fatcat:xxikpkyrkvdizgijskmk2qxkay

In this paper, we develop software for decomposing sparse tensors that is portable to and performant on a variety of multicore, manycore, and GPU computing architectures. ... Not only are the specifics of our approaches and implementation interesting for tuning tensor computations, but they also provide a roadmap for developing other portable high-performance codes. ... We thank the anonymous referees for critical feedback on this work, resulting in substantial improvements in the presentation. ...

arXiv:1809.09175v2 fatcat:eflkr3ocmjgtrfizp6gn4jti6y

Multiple Versions

To alleviate the effects of scattered data access we combine loosely coupled outer iterations with tightly coupled block sparse matrix multiple vector operations, which enables pure data streaming. ... At the node level we show that it is possible to decouple the sparse matrix problem posed by KPM from main memory bandwidth both on CPU and GPU. ... ACKNOWLEDGMENTS We are indebted to the Swiss National Computing Centre for granting access to Piz Daint. ...

doi:10.1109/ipdps.2015.76 dblp:conf/ipps/KreutzerPHWAF15 fatcat:afbnmvlqlrgb5iia35rwupxmma

Multiple Versions

This paper proposes a method for implementing dense matrix multiplication on FP64 (DGEMM) and FP32 (SGEMM) using Tensor Cores on NVIDIA's graphics processing units (GPUs). ... The proposed method adopts the Ozaki scheme, an accurate matrix multiplication algorithm based on error-free transformation for matrix multiplication. ... This research was partially supported by the Japan Society for the Promotion of Science (JSPS) KAKENHI Grant Number 19K20286 and MEXT as "Exploratory Issue on Post-K computer" (Development of verified ...

doi:10.1007/978-3-030-50743-5_12 fatcat:wanjoziydvf4hlmz4gjk7om3aq

We describe Kokkos Kernels, a library of kernels for sparse linear algebra, dense linear algebra and graph kernels. ... Specifically, we demonstrate the performance of four sparse kernels, three dense batched kernels, two graph kernels and one team level algorithm. ... ACKNOWLEDGMENTS We would like to acknowledge the contributions of several "alumni" and friends of the Kokkos Kernels library ...

arXiv:2103.11991v1 fatcat:m7iskgt5kjdjnex7lenqsjj6z4

Open Access

Evaluated on the LSTM for speech recognition benchmark, ESE is 43x and 3x faster than Core i7 5930k CPU and Pascal Titan X GPU implementations. ... a full LSTM for speech recognition with a power dissipation of 41 Watts. ... This work was supported by National Natural Science Foundation of China (No.61373026, 61622403, 61261160501). ...

arXiv:1612.00694v2 fatcat:a65wy2piqnezjjjmauub7rdlai

Multiple Versions

Auto-Tuning CUDA Parameters for Sparse Matrix-Vector Multiplication on GPUs

Preserved Fulltext

Auto-SpMV: Automated Optimizing SpMV Kernels on GPU [article]

Preserved Fulltext

A model-driven partitioning and auto-tuning integrated framework for sparse matrix-vector multiplication on GPUs

Preserved Fulltext

Performance Enhancement Strategies for Sparse Matrix-Vector Multiplication (SpMV) and Iterative Linear Solvers [article]

Preserved Fulltext

Implementing Sparse Matrix-Vector Multiplication with QCSR on GPU

Preserved Fulltext

GHOST: Building Blocks for High Performance Sparse Linear Algebra on Heterogeneous Systems

Preserved Fulltext

Other Versions

Exact sparse matrix-vector multiplication on GPU's and multicore architectures

Preserved Fulltext

Other Versions

The challenges of writing portable, correct and high performance libraries for GPUs

Preserved Fulltext

Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi [chapter]

Preserved Fulltext

RT-CUDA: A Software Tool for CUDA Code Restructuring

Preserved Fulltext

Software for Sparse Tensor Decomposition on Emerging Computing Architectures [article]

Preserved Fulltext

Other Versions

Performance Engineering of the Kernel Polynomal Method on Large-Scale CPU-GPU Systems

Preserved Fulltext

Other Versions

DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions [chapter]

Preserved Fulltext

Kokkos Kernels: Performance Portable Sparse/Dense Linear Algebra and Graph Kernels [article]

Preserved Fulltext

ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA [article]

Preserved Fulltext

Other Versions