A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2011; you can also visit the original URL.
The file type is application/pdf
.
Filters
Auto-Tuning CUDA Parameters for Sparse Matrix-Vector Multiplication on GPUs
2010
2010 International Conference on Computational and Information Sciences
The sparse matrix-vector multiplication (SpMV) is a critical operation in a wide variety of scientific and engineering applications, such as sparse linear algebra and image processing. ...
This paper presents an auto-tuning framework that can automatically compute and select CUDA parameters for SpMV to obtain the optimal performance on specific GPUs. ...
Fig. 1 shows an example for a widely-used sparse matrix format called CSR (Compressed Sparse Row) or CRS (Compressed Row Storage). ...
doi:10.1109/iccis.2010.285
fatcat:tzjmvn6hmzeu3eu2wq4uz2tqgi
Auto-SpMV: Automated Optimizing SpMV Kernels on GPU
[article]
2023
arXiv
pre-print
Sparse matrix-vector multiplication (SpMV) is an essential linear algebra operation that dominates the computing cost in many scientific applications. ...
In the compile-time mode, Auto-SpMV tweaks the compilation parameters, while in the run-time mode, Auto-SpMV selects the best sparse format for the sparse input matrix. ...
SpMV Processing on GPU The Sparse Matrix-Vector Multiplication (SpMV) is a non-trivial Level-2 BLAS operation that finds the dense vector product Y of a sparse matrix A and a dense vector X such that Y ...
arXiv:2302.05662v1
fatcat:xe4in7wlljhr5ffoce6jzomika
A model-driven partitioning and auto-tuning integrated framework for sparse matrix-vector multiplication on GPUs
2011
Proceedings of the 2011 TeraGrid Conference on Extreme Digital Discovery - TG '11
Sparse Matrix-Vector Multiplication (SpMV) is very common to scientific computing. ...
This paper presents an innovative performance-model driven approach for partitioning sparse matrix into appropriate formats, and auto-tuning configurations of CUDA kernels to improve the performance of ...
For CSR vector kernel, one warp (32 threads) is responsible for computing the multiplication of one row of the matrix and the vector. ...
doi:10.1145/2016741.2016744
fatcat:yukl7bdijjfc7gqlrx5sv2k7wy
Performance Enhancement Strategies for Sparse Matrix-Vector Multiplication (SpMV) and Iterative Linear Solvers
[article]
2022
arXiv
pre-print
The crucial computing kernel for such iterative solutions is the multiplication of a sparse matrix by a dense vector. ...
Efficient implementation of sparse matrix-vector multiplication (SpMV) and linear solvers are therefore essential and has been subjected to extensive research across a variety of computing architectures ...
Acknowledgement The authors acknowledge with thanks the technical and financial support from the Deanship of Scientific Research (DSR) at the King Abdulaziz University (KAU), Jeddah, Saudi Arabia, under ...
arXiv:2212.07490v1
fatcat:3jtf7lhjtbgdhnozrgx3k27wiu
Implementing Sparse Matrix-Vector Multiplication with QCSR on GPU
2013
Applied Mathematics & Information Sciences
Various data formats to store the sparse matrix have been implemented on GPUs to maximize the performance. ...
However, it is still challenging to optimize computations with irregular data access patterns like sparse matrix-vector multiplication (SPMV). ...
Some of the first work on sparse matrix-vector multiplication on GPU architectures was by Bolz et al [26] . ...
doi:10.12785/amis/070207
fatcat:okhgxfolzjbp5mb4tevbzf25bq
GHOST: Building Blocks for High Performance Sparse Linear Algebra on Heterogeneous Systems
2016
International journal of parallel programming
The "General, Hybrid, and Optimized Sparse Toolkit" (GHOST) is a collection of building blocks that targets algorithms dealing with sparse matrix representations on current and future large-scale systems ...
GPUs, and the Intel Xeon Phi. ...
Special thanks go to Andreas Alvermann for providing sparse matrix generation functions for testing and everyone else who contributed to GHOST, directly or indirectly. ...
doi:10.1007/s10766-016-0464-z
fatcat:w64ypr4otfgcxbfftzzuko5nim
Exact sparse matrix-vector multiplication on GPU's and multicore architectures
2010
Proceedings of the 4th International Workshop on Parallel and Symbolic Computation - PASCO '10
We propose different implementations of the sparse matrix--dense vector multiplication () for finite fields and rings /m. ...
We take advantage of graphic card processors (GPU) and multi-core architectures. Our aim is to improve the speed of in the library, and henceforth the speed of its black box algorithms. ...
On the numerical side, several libraries automatically tune the sparse matrix kernels [19, 20, 16] and recently some kernels have been proposed e.g. for GPU's [17, 2, 1] . ...
doi:10.1145/1837210.1837224
dblp:conf/cap/BoyerDG10
fatcat:kpbem74jmre2lkrfutypof7lxi
The challenges of writing portable, correct and high performance libraries for GPUs
2011
SIGARCH Computer Architecture News
Specifically we target the linear solver module, including Conjugate Gradient, Jacobi and MinRes solvers for sparse matrices. ...
This problem requires a different set of tradeoffs than finding the best runtime for a single solution. ...
We thank SCI Institute personnel Jeroen Stinstra, Darrell Swenson (for guidance and test problems), and especially Ayla Khan (for help integrating our work with the SCI tools). ...
doi:10.1145/2082156.2082158
pmid:23807820
pmcid:PMC3691863
fatcat:lk22rfzsv5cozny52kuspmldqm
Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi
[chapter]
2014
Lecture Notes in Computer Science
The computation, known as the sparse-matrix vector multiplication (SpMV), and with some variants, such as the sparse-matrix matrix multiplication (SpMM), they form the computational core of many applications ...
The core of many scientific applications involves the multiplication of a large, sparse matrix with a single or multiple dense vectors which are not compute-bound but memory-bound. ...
An m×n sparse matrix A with τ nonzeros is usually stored in the compressed row storage format CRS which uses three arrays: -cids[.] is an integer array of size τ that stores the column ids for each nonzero ...
doi:10.1007/978-3-642-55224-3_52
fatcat:vwfpsy3ur5ahznfdoicggp26dq
RT-CUDA: A Software Tool for CUDA Code Restructuring
2016
International journal of parallel programming
A blocked sparse matrix-vector multiplication for NVIDIA GPUs [10] has been implemented. ...
An optimized version of Sparse Matrix Vector (SpMV) Multiplication [27] has been implemented on NVIDIA GPUs using CUDA. ...
modification The modification on the code has no effect on the parameters for Jacobi solver ...
doi:10.1007/s10766-016-0433-6
fatcat:xxikpkyrkvdizgijskmk2qxkay
Software for Sparse Tensor Decomposition on Emerging Computing Architectures
[article]
2019
arXiv
pre-print
In this paper, we develop software for decomposing sparse tensors that is portable to and performant on a variety of multicore, manycore, and GPU computing architectures. ...
Not only are the specifics of our approaches and implementation interesting for tuning tensor computations, but they also provide a roadmap for developing other portable high-performance codes. ...
We thank the anonymous referees for critical feedback on this work, resulting in substantial improvements in the presentation. ...
arXiv:1809.09175v2
fatcat:eflkr3ocmjgtrfizp6gn4jti6y
Performance Engineering of the Kernel Polynomal Method on Large-Scale CPU-GPU Systems
2015
2015 IEEE International Parallel and Distributed Processing Symposium
To alleviate the effects of scattered data access we combine loosely coupled outer iterations with tightly coupled block sparse matrix multiple vector operations, which enables pure data streaming. ...
At the node level we show that it is possible to decouple the sparse matrix problem posed by KPM from main memory bandwidth both on CPU and GPU. ...
ACKNOWLEDGMENTS We are indebted to the Swiss National Computing Centre for granting access to Piz Daint. ...
doi:10.1109/ipdps.2015.76
dblp:conf/ipps/KreutzerPHWAF15
fatcat:afbnmvlqlrgb5iia35rwupxmma
DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions
[chapter]
2020
Lecture Notes in Computer Science
This paper proposes a method for implementing dense matrix multiplication on FP64 (DGEMM) and FP32 (SGEMM) using Tensor Cores on NVIDIA's graphics processing units (GPUs). ...
The proposed method adopts the Ozaki scheme, an accurate matrix multiplication algorithm based on error-free transformation for matrix multiplication. ...
This research was partially supported by the Japan Society for the Promotion of Science (JSPS) KAKENHI Grant Number 19K20286 and MEXT as "Exploratory Issue on Post-K computer" (Development of verified ...
doi:10.1007/978-3-030-50743-5_12
fatcat:wanjoziydvf4hlmz4gjk7om3aq
Kokkos Kernels: Performance Portable Sparse/Dense Linear Algebra and Graph Kernels
[article]
2021
arXiv
pre-print
We describe Kokkos Kernels, a library of kernels for sparse linear algebra, dense linear algebra and graph kernels. ...
Specifically, we demonstrate the performance of four sparse kernels, three dense batched kernels, two graph kernels and one team level algorithm. ...
ACKNOWLEDGMENTS We would like to acknowledge the contributions of several "alumni" and friends of the Kokkos Kernels library ...
arXiv:2103.11991v1
fatcat:m7iskgt5kjdjnex7lenqsjj6z4
ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA
[article]
2017
arXiv
pre-print
Evaluated on the LSTM for speech recognition benchmark, ESE is 43x and 3x faster than Core i7 5930k CPU and Pascal Titan X GPU implementations. ...
a full LSTM for speech recognition with a power dissipation of 41 Watts. ...
This work was supported by National Natural Science Foundation of China (No.61373026, 61622403, 61261160501). ...
arXiv:1612.00694v2
fatcat:a65wy2piqnezjjjmauub7rdlai
« Previous
Showing results 1 — 15 out of 417 results