Leveraging One-Sided Communication for Sparse Triangular Solvers.

Hybrid versions of two iterative linear solver strategies are presented, one takes advantage of block triangular form structure while the other uses a Schur complement technique. ... Results indicate up to a 27x improvement in total simulation time on 256 cores. ... The triangular solve in this particular case has multiple right-hand sides where the right-hand sides are themselves sparse columns of C. ...

doi:10.1007/978-3-319-17353-5_9 fatcat:5jrocvi2b5cltcfgjwkc6gs6by

This is particularly the case for Sparse Triangular Solver (SpTRSV) which introduces additional two-dimensional computation dependencies among subsequent computation steps. ... Dependency information is exchanged and shared among GPUs, thus warrant for efficient memory allocation, data partitioning, and workload distribution as well as fine-grained communication and synchronization ... heap and relying on the one-sided communication primitives in NVSHMEM for inter-GPU communication. ...

arXiv:2012.06959v1 fatcat:am7guw7i5fchxafrkp34plwvky

Open Access

Sparse solvers provide essential functionality for a wide variety of scientific applications. ... Highly parallel sparse solvers are essential for continuing advances in high-fidelity, multi-physics and multi-scale simulations, especially as we target exascale platforms. ... The second technique leverages the one-sided MPI communication functions to implement a synchronization-free task queue, allowing more overlap of communication and computation, leading to additional 2× ...

doi:10.1098/rsta.2019.0053 pmid:31955673 fatcat:bqw6xqixbrabddmxglmtcbw2wa

Szczepanski

Citation

Hartwig Anzt, Erik Boman, Rob Falgout, Pieter Ghysels, Michael Heroux, Xiaoye Li, Lois Curfman McInnes, Richard Tran Mills, Sivasankaran Rajamanickam, Karl Rupp, Barry Smith, Ichitaro Yamazaki, Ulrike Meier Yang. "Preparing sparse solvers for exascale computing." Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 378.2166 (2020) 20190053

In particular, we study ILUPACK, a package for the solution of sparse linear systems via Krylov subspace methods that relies on a modern inverse-based multilevel ILU (incomplete LU) preconditioning technique ... We present new data-parallel versions of the preconditioner and the most important solvers contained in the package that significantly improve its performance without affecting its accuracy. ... [34] proposed a new GPU solver for sparse triangular systems, for matrices stored in the CSC format, based on the self-scheduled strategy. ...

doi:10.19153/cleiej.24.1.6 doaj:cf900516b6334e27afbe4102fa203079 fatcat:ohhmcgyrl5hgfaib7hb2rdyhom

DOAJ SciELO

Sparse triangular solver is one such kernel and is the focus of this paper. ... As a result, on a 12-core Intel R Xeon R processor, our approach improves the performance of sparse triangular solver by 1.6x, compared to the conventional level-scheduling with barrier synchronization ... Heroux for his insights related to the implementation and performance of conjugate gradient, and Kiran Pamnany for sharing his dissemination barrier implementation. Bibliography ...

doi:10.1007/978-3-319-07518-1_8 fatcat:z3lzntgb6bh27i3lo2lzmojox4

When the sparse matrix is symmetric and positive de nite, direct methods use Cholesky factorization while iterative methods rely on Conjugate Gradients. ... Our goal is to develop a scalable and memory-e cient hybrid of the two methods that can be implemented with high-e ciency on both serial and parallel computers and be suitable for a wide-range of problems ... The key is to leverage technology that has been developed for sparse direct methods. ...

doi:10.1002/(sici)1096-9128(200002/03)12:2/3<53::aid-cpe473>3.3.co;2-2 fatcat:7bqxv5ygpbbvneb6yfxshyrske

sparse linear systems. ... We present a set of new batched CUDA kernels for the LU factorization of a large collection of independent problems of different size, and the subsequent triangular solves. ... linear solvers" (d65). ...

doi:10.1109/icpp.2017.18 dblp:conf/icpp/AnztDFQ17 fatcat:sryw4eagnnf3zbuzllaxhm3ebm

When the sparse matrix is symmetric and positive de nite, direct methods use Cholesky factorization while iterative methods rely on Conjugate Gradients. ... Our goal is to develop a scalable and memory-e cient hybrid of the two methods that can be implemented with high-e ciency on both serial and parallel computers and be suitable for a wide-range of problems ... The key is to leverage technology that has been developed for sparse direct methods. ...

doi:10.1002/(sici)1096-9128(200002/03)12:2/3<53::aid-cpe473>3.0.co;2-b fatcat:kv7m4mb5lnh4jgigz6ygpmrkqy

The performance of the massively parallel direct multifrontal solver Watson Sparse Matrix Package (WSMP) for solving large sparse systems of linear equations arising in implicit finite element method on ... unstructured (free) meshes in solid mechanics was evaluated on one of the most powerful supercomputers currently available to the open science community-the sustained petascale high performance computing ... Acknowledgments The authors would like to thank the Private Sector Program and the Blue Waters sustained-petascale computing project at the National Center for Supercomputing Applications (NCSA). ...

doi:10.1016/j.cma.2016.01.011 fatcat:7fvypjoinrcmnigv3hldkqfcxe

For all these implementations, we analyze the communication patterns and perform a comparative analysis of their performance and scalability on a cluster consisting of 16 nodes, with 16 cores each. ... We target the parallel solution of sparse linear systems via iterative Krylov subspace-based methods enhanced with ILU-type preconditioners on clusters of multicore processors. ... On the positive side, it diminishes the amount of messages (though not the total volume of communication), and it does not change the numerical properties of the solver. ...

doi:10.1002/cpe.4280 fatcat:m7hsdjutjnhfhmxxl4q5izo6ka

Our experiments with an nVidia S2070 GPU report speed-ups up to 6× for the hybrid band solver based on the LU factorization over analogous CPU-only routines in Intel's MKL. ... As a practical demonstration of these benefits, we plug the new CPU-GPU codes into a sparse matrix Lyapunov equation solver, showing a 3× acceleration on the solution of a large-scale benchmark arising ... The advantages of the hybrid band routines carry over to the solution of sparse Lyapunov solvers, with an acceleration factor around 2-3× with respect to the analogous solver based on MKL. ...

doi:10.1007/978-3-319-09153-2_29 fatcat:vf37puekijcphjxs6vqmmw7wim

In this paper, we investigate the use of an asynchronous task paradigm, one-sided communication and dynamic scheduling in implementing sparse Cholesky factorization (symPACK) on large-scale distributed ... Our solver symPACK relies on efficient and flexible communication primitives provided by the UPC++ library. ... Another very important characteristic of communication protocols is whether a communication primitive is two-sided or one-sided. ...

arXiv:1608.00044v2 fatcat:wfxqlgser5e2rmgwxjcum23m6i

Multiple Versions

The computation patterns in sparse numerical methods are guided by the input sparsity structure and the sparse algorithm itself. ... Sympiler is a domain-specific code generator that optimizes sparse matrix computations by decoupling the symbolic analysis phase from the numerical manipulation stage in sparse codes. ... Motivating Scenario Sparse triangular solve takes a lower triangular matrix L and a righthand side (RHS) vector b and solves the linear equation Lx = b for x. ...

doi:10.1145/3126908.3126936 dblp:conf/sc/CheshmiKSD17 fatcat:joe4jxi2lraelbjwo65l3sarpa

Multiple Versions

To leverage significant software development effort, general purpose unstructured codes are often used in structured or semi-structured applications. ... We show that O(n log n) computational complexities, competitive with classic Fourier methods, are achievable for some classes of semi-structured spectral element applications. ... For this application, we use SI-2, the SI scheme with a two-way data mapping [12] implemented in our solver package [9] . ...

doi:10.1016/b978-008044046-0.50500-5 fatcat:ymhmr2vxujdktmi4dsu3xbtq3y

improve the energy performance of sparse linear system solvers, without negatively impacting their performance. ... One contribution of 14 to a Theme Issue 'Stochastic modelling and energy-efficient computing for weather and climate prediction' . ... (c) Leveraging the CPU states on manycore systems The results in §3b illustrate that GPUs are among the most energy-efficient hardware architectures for sparse linear algebra. ...

doi:10.1098/rsta.2013.0279 pmid:24842036 fatcat:kw7cnmvzrff6pmihhqenl53uwm

Szczepanski

A Hybrid Approach for Parallel Transistor-Level Full-Chip Circuit Simulation [chapter]

Preserved Fulltext

Fast and Scalable Sparse Triangular Solver for Multi-GPU Based HPC Architectures [article]

Preserved Fulltext

Preparing sparse solvers for exascale computing

Preserved Fulltext

Accelerating advanced preconditioning methods on hybrid architectures

Preserved Fulltext

Sparsifying Synchronization for High-Performance Shared-Memory Sparse Triangular Solver [chapter]

Preserved Fulltext

Towards a scalable hybrid sparse solver

Preserved Fulltext

Variable-Size Batched LU for Small Matrices and Its Integration into Block-Jacobi Preconditioning

Preserved Fulltext

Towards a scalable hybrid sparse solver

Preserved Fulltext

Sparse matrix factorization in the implicit finite element method on petascale architecture

Preserved Fulltext

Communication in task-parallel ILU-preconditioned CG solvers using MPI + OmpSs

Preserved Fulltext

Accelerating Band Linear Algebra Operations on GPUs with Application in Model Reduction [chapter]

Preserved Fulltext

An Asynchronous Task-based Fan-Both Sparse Cholesky Solver [article]

Preserved Fulltext

Other Versions

Sympiler

Preserved Fulltext

OnAn ( log ) solution algorithm for spectral element methods [chapter]

Preserved Fulltext

Improving the energy efficiency of sparse linear system solvers on multicore and manycore systems

Preserved Fulltext