Compiler Algorithms for Optimizing Locality and Parallelism on Shared and Distributed-Memory Machines.

Recently, scalable machines based on logically shared physically distributed memory have been designed and implemented. ... In this paper, we propose an algorithm that can be employed by optimizing compilers for different types of parallel architectures. ... Ramanujam is supported in part by an NSF Young Investigator Award CCR-9457768 and by NSF Grant CCR-9210422. ...

doi:10.1006/jpdc.2000.1639 fatcat:ie4kkynfmbeezedi2b4z6t6esm

Recently, scalable machines based on logically shared physically distributed memory have been designed and implemented. ... In this paper, we propose an algorithm that can be employed by optimizing compilers for different types of parallel architectures. ... Ramanujam is supported in part by an NSF Young Investigator Award CCR-9457768 and by NSF Grant CCR-9210422. ...

doi:10.1109/pact.1997.644019 dblp:conf/IEEEpact/KandemirRC97 fatcat:gpco4hn4kvbc3ez3zj4uczmudu

This work (from my PhD dissertation, supervised by Ken Kennedy) was one of the early papers to optimize for and experimentally explore the tension between data locality and parallelism on shared memory ... Today there is an urgent need for algorithms, programming language systems and tools, and hardware that deliver on the potential of parallelism due to the end of Dennard scaling. ... I thank the anonymous reviewer for improving the first version and Steve Blackburn, Mary Hall, and Todd Mytkowicz for discussions and comments on this version. ...

doi:10.1145/2591635.2591646 dblp:conf/ics/McKinley14 fatcat:dw7ex6lsefbozlajahijmxuxau

The result is portable highperformance compilers that run on a large variety of shared and distributed memory multiprocessors. ... Partitioned Global Address Space (PGAS) languages combine the programming convenience of shared memory with the locality and performance control of message passing. ... threads and their associated data to processors: on a distributed memory machine, the local memory of a processor holds both the thread's private data and the shared data with affinity to that thread. ...

doi:10.1145/1278177.1278183 dblp:conf/issac/YelickBCCDDGHHHIKNSWW07 fatcat:hpedjb24vvfkbpi7fbawt6xf4u

Citation

Katherine Yelick, Parry Husbands, Costin Iancu, Amir Kamil, Rajesh Nishtala, Jimmy Su, Michael Welcome, Tong Wen, Dan Bonachea, Wei-Yu Chen, Phillip Colella, Kaushik Datta, Jason Duell, Susan L. Graham, Paul Hargrove, Paul Hilfinger. "Productivity and performance using partitioned global address space languages." Proceedings of the 2007 international workshop on Parallel symbolic computation - PASCO '07 (2007) 24-32

The Partitioned Global Address Space (PGAS) model of Unified Parallel C (UPC) can help users express and manage application data locality on non-uniform memory access (NUMA) multi-core shared-memory systems ... Second, we use two numerical computing kernels, parallel matrix–matrix multiplication and parallel 3-D FFT, to demonstrate the end-to-end development and optimization for UPC applications. ... For example, OpenMP provides compiler directives to easily parallelize for loops but the speedups may be dismal if the data distribution and access locality are not optimized accordingly. ...

doi:10.1155/2010/646829 fatcat:q63ngpj47jblhfzbfcdehsmuyi

DOAJ

With this metric, our algorithm improves or matches hand-coded parallel programs on shared-memory, bus-based parallel machines for eight of the nine programs in our test suite. ... The algorithm optimizes for data locality and parallelism, reducing or eliminating false sharing. It also uses interprocedural analysis and transformations to improve the granularity of parallelism. ... I thank Mary Hall, Chau-Wen Tseng, Preston Briggs, Paul Havlak, and Nat McIntosh for their comments and support throughout the development of this work. ...

doi:10.1145/181181.181265 dblp:conf/ics/McKinley94 fatcat:4eulalgo3rc5naym2lhb2wjl6i

We thus need well-founded models and efficient new tools for hierarchical parallel machines, in order to connect algorithm design and complexity results to high-performance program implementation. ... The similarity among the issues of managing memory hierarchies and those of parallel computation has been pointed out before (see for instance [213]). ... Acknowledgments We wish to thank all the participants to the GI-Dagstuhl-Forschungsseminar "Algorithms for Memory Hierarchies". ...

doi:10.1007/3-540-36574-5_15 fatcat:mgkm5xi34vanfjx2fkryqfoq4i

From our experiments, we learned that our compiler performs well for a variety of applications on the T3D and T3E and we found a few sophisticated techniques that could improve performance even more once ... In this paper, we present our experience with a compiler framework for automatic parallelization and communication generation that has the potential to reduce the time-consuming hand-tuning that would ... without into one-sided shared-memory codes for execution on NCC machines and analyzing their performance. ...

doi:10.1109/71.993205 fatcat:m6ts7o7jxvg5hbb23pjayd45a4

We compare the original parallel program to the hand-optimized program, and show that our algorithm improves 3 programs, matches 4 programs, and degrades 1 program in our test suite on a shared-memory, ... optimizations, providing evidence that we need both parallel algorithms and compiler optimizations to effectively utilize parallel machines. ... In this paper, we consider optimizing Fortran programs for symmetric shared-memory, bus-based parallel machines with local caches. ...

doi:10.1109/71.706049 fatcat:3m5odkybzvgm3putgvlki3aznu

This paper describes Fortran 90 and the standardized language extensions for both shared-memory and distributed-memory parallelism. ... Several casestudies are examined showing how the distributed-memory extensions (High Performance Fortran) are used both for data parallel and MIMD (multiple instruction multiple data) algorithms. ... to achieve parallelism locally on each of the SMP machines. ...

doi:10.1142/9789814447188_0110 fatcat:zgetmnukpnd5ppcwtnmwm6t3ia

The experimental results shed further light on the impact of data distributions on locality and performance and confirm the accuracy of the complexity model as a useful tool for the design and analysis ... PGAS programming languages provide ease-of-use through a global shared address space while emphasizing performance by providing locality awareness and a partition of the address space. ... The most popular is the Parallel Random Access Machine (PRAM) model, which is used for both shared memory and network-based systems [10] . ...

doi:10.1007/978-3-540-75444-2_63 fatcat:vjijb5ktrngfvl5vbbv77kuuci

model makes it a natural choice for a single multi-core machine, where the main memory is physically shared. ... As the communication cost for remote accesses is removed because all accesses are physically local in a multi-core, we find that the overhead of pointer arithmetic on shared data accesses becomes a prominent ... Acknowledgements We would like to thank the anonymous reviewers for their useful comments on this paper. This work is supported by NSF Career CCF-0643664, NSF CCF-0811427, and NSF CCF-0833082. ...

doi:10.1145/1809961.1809976 fatcat:g5q23kqffredjeb7rexpuicix4

Our implementation achieves better scaling and parallel efficiency in both shared memory and distributed memory settings than previous efforts using UPC [1] and MPI [2]. ... However, to obtain performance portability with UPC in both shared memory and distributed memory settings requires the careful use of onesided reads and writes to minimize the impact of high latency communication ... Acknowledgment The authors thank the Renaissance Computing Institute for the use of the Kitty Hawk cluster and the University of North Carolina for the use of the Topsail cluster and the SGI Altix. ...

doi:10.1109/icpp.2008.19 dblp:conf/icpp/OlivierP08 fatcat:wgivv2ozofgjlm6fvlgkuxrqvm

Significant progress has been made in the development of programming languages and tools that are suitable for hybrid computer architectures that group several shared-memory multicores interconnected through ... A performance evaluation reports both scaling and absolute performance numbers on up to 32768 cores of a Power 775 supercomputer. ... BACKGROUND PGAS programming languages use the same programming model for local, shared and distributed memory hardware. ...

doi:10.1109/tpds.2015.2405551 fatcat:isr4fuw6nvfpzfo4abngauwame

The hardware techniques include three different NUMA shared memory access mechanisms and the software ones provide a mechanism to integrate and optimize NUMA computation into the standard parallel random ... The emulated shared memory (ESM) architectures are good candidates for future general purpose parallel computers due to their ability to provide an easy-to-use explicitly parallel synchronous model of ... Introduction Among the architectural approaches for parallel and multicore computing making use of memory -let it be distributed either on-chip or among a number of chips -there are very few that support ...

doi:10.15803/ijnc.4.1_189 fatcat:hmoejoyocncfhme3urknmvopku

Open Access

Compiler Algorithms for Optimizing Locality and Parallelism on Shared and Distributed-Memory Machines

Preserved Fulltext

Compiler algorithms for optimizing locality and parallelism on shared and distributed memory machines

Preserved Fulltext

Author retrospective for optimizing for parallelism and data locality

Preserved Fulltext

Productivity and performance using partitioned global address space languages

Preserved Fulltext

Optimizing UPC Programs for Multi-Core Systems

Preserved Fulltext

Evaluating automatic parallelization for efficient execution on shared-memory multiprocessors

Preserved Fulltext

Hierarchical Models and Software Tools for Parallel Programming [chapter]

Preserved Fulltext

An advanced compiler framework for non-cache-coherent multiprocessors

Preserved Fulltext

A compiler optimization algorithm for shared-memory multiprocessors

Preserved Fulltext

Modern Fortran as the Port of Entry for Scientific Parallel Computing

Preserved Fulltext

Towards a Complexity Model for Design and Analysis of PGAS-Based Algorithms [chapter]

Preserved Fulltext

ScaleUPC

Preserved Fulltext

Scalable Dynamic Load Balancing Using UPC

Preserved Fulltext

Combining Static and Dynamic Data Coalescing in Unified Parallel C

Preserved Fulltext

NUMA Computing with Hardware and Software Co-Support on Configurable Emulated Shared Memory Architectures

Preserved Fulltext