A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2017; you can also visit the original URL.
The file type is application/pdf
.
Filters
Compiler Algorithms for Optimizing Locality and Parallelism on Shared and Distributed-Memory Machines
2000
Journal of Parallel and Distributed Computing
Recently, scalable machines based on logically shared physically distributed memory have been designed and implemented. ...
In this paper, we propose an algorithm that can be employed by optimizing compilers for different types of parallel architectures. ...
Ramanujam is supported in part by an NSF Young Investigator Award CCR-9457768 and by NSF Grant CCR-9210422. ...
doi:10.1006/jpdc.2000.1639
fatcat:ie4kkynfmbeezedi2b4z6t6esm
Compiler algorithms for optimizing locality and parallelism on shared and distributed memory machines
Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques
Recently, scalable machines based on logically shared physically distributed memory have been designed and implemented. ...
In this paper, we propose an algorithm that can be employed by optimizing compilers for different types of parallel architectures. ...
Ramanujam is supported in part by an NSF Young Investigator Award CCR-9457768 and by NSF Grant CCR-9210422. ...
doi:10.1109/pact.1997.644019
dblp:conf/IEEEpact/KandemirRC97
fatcat:gpco4hn4kvbc3ez3zj4uczmudu
Author retrospective for optimizing for parallelism and data locality
2014
25th Anniversary International Conference on Supercomputing Anniversary Volume -
This work (from my PhD dissertation, supervised by Ken Kennedy) was one of the early papers to optimize for and experimentally explore the tension between data locality and parallelism on shared memory ...
Today there is an urgent need for algorithms, programming language systems and tools, and hardware that deliver on the potential of parallelism due to the end of Dennard scaling. ...
I thank the anonymous reviewer for improving the first version and Steve Blackburn, Mary Hall, and Todd Mytkowicz for discussions and comments on this version. ...
doi:10.1145/2591635.2591646
dblp:conf/ics/McKinley14
fatcat:dw7ex6lsefbozlajahijmxuxau
Productivity and performance using partitioned global address space languages
2007
Proceedings of the 2007 international workshop on Parallel symbolic computation - PASCO '07
The result is portable highperformance compilers that run on a large variety of shared and distributed memory multiprocessors. ...
Partitioned Global Address Space (PGAS) languages combine the programming convenience of shared memory with the locality and performance control of message passing. ...
threads and their associated data to processors: on a distributed memory machine, the local memory of a processor holds both the thread's private data and the shared data with affinity to that thread. ...
doi:10.1145/1278177.1278183
dblp:conf/issac/YelickBCCDDGHHHIKNSWW07
fatcat:hpedjb24vvfkbpi7fbawt6xf4u
Optimizing UPC Programs for Multi-Core Systems
2010
Scientific Programming
The Partitioned Global Address Space (PGAS) model of Unified Parallel C (UPC) can help users express and manage application data locality on non-uniform memory access (NUMA) multi-core shared-memory systems ...
Second, we use two numerical computing kernels, parallel matrix–matrix multiplication and parallel 3-D FFT, to demonstrate the end-to-end development and optimization for UPC applications. ...
For example, OpenMP provides compiler directives to easily parallelize for loops but the speedups may be dismal if the data distribution and access locality are not optimized accordingly. ...
doi:10.1155/2010/646829
fatcat:q63ngpj47jblhfzbfcdehsmuyi
Evaluating automatic parallelization for efficient execution on shared-memory multiprocessors
1994
Proceedings of the 8th international conference on Supercomputing - ICS '94
With this metric, our algorithm improves or matches hand-coded parallel programs on shared-memory, bus-based parallel machines for eight of the nine programs in our test suite. ...
The algorithm optimizes for data locality and parallelism, reducing or eliminating false sharing. It also uses interprocedural analysis and transformations to improve the granularity of parallelism. ...
I thank Mary Hall, Chau-Wen Tseng, Preston Briggs, Paul Havlak, and Nat McIntosh for their comments and support throughout the development of this work. ...
doi:10.1145/181181.181265
dblp:conf/ics/McKinley94
fatcat:4eulalgo3rc5naym2lhb2wjl6i
Hierarchical Models and Software Tools for Parallel Programming
[chapter]
2003
Lecture Notes in Computer Science
We thus need well-founded models and efficient new tools for hierarchical parallel machines, in order to connect algorithm design and complexity results to high-performance program implementation. ...
The similarity among the issues of managing memory hierarchies and those of parallel computation has been pointed out before (see for instance [213]). ...
Acknowledgments We wish to thank all the participants to the GI-Dagstuhl-Forschungsseminar "Algorithms for Memory Hierarchies". ...
doi:10.1007/3-540-36574-5_15
fatcat:mgkm5xi34vanfjx2fkryqfoq4i
An advanced compiler framework for non-cache-coherent multiprocessors
2002
IEEE Transactions on Parallel and Distributed Systems
From our experiments, we learned that our compiler performs well for a variety of applications on the T3D and T3E and we found a few sophisticated techniques that could improve performance even more once ...
In this paper, we present our experience with a compiler framework for automatic parallelization and communication generation that has the potential to reduce the time-consuming hand-tuning that would ...
without into one-sided shared-memory codes for execution on NCC machines and analyzing their performance. ...
doi:10.1109/71.993205
fatcat:m6ts7o7jxvg5hbb23pjayd45a4
A compiler optimization algorithm for shared-memory multiprocessors
1998
IEEE Transactions on Parallel and Distributed Systems
We compare the original parallel program to the hand-optimized program, and show that our algorithm improves 3 programs, matches 4 programs, and degrades 1 program in our test suite on a shared-memory, ...
optimizations, providing evidence that we need both parallel algorithms and compiler optimizations to effectively utilize parallel machines. ...
In this paper, we consider optimizing Fortran programs for symmetric shared-memory, bus-based parallel machines with local caches. ...
doi:10.1109/71.706049
fatcat:3m5odkybzvgm3putgvlki3aznu
Modern Fortran as the Port of Entry for Scientific Parallel Computing
1996
Computing in High Energy Physics '95
This paper describes Fortran 90 and the standardized language extensions for both shared-memory and distributed-memory parallelism. ...
Several casestudies are examined showing how the distributed-memory extensions (High Performance Fortran) are used both for data parallel and MIMD (multiple instruction multiple data) algorithms. ...
to achieve parallelism locally on each of the SMP machines. ...
doi:10.1142/9789814447188_0110
fatcat:zgetmnukpnd5ppcwtnmwm6t3ia
Towards a Complexity Model for Design and Analysis of PGAS-Based Algorithms
[chapter]
2007
Lecture Notes in Computer Science
The experimental results shed further light on the impact of data distributions on locality and performance and confirm the accuracy of the complexity model as a useful tool for the design and analysis ...
PGAS programming languages provide ease-of-use through a global shared address space while emphasizing performance by providing locality awareness and a partition of the address space. ...
The most popular is the Parallel Random Access Machine (PRAM) model, which is used for both shared memory and network-based systems [10] . ...
doi:10.1007/978-3-540-75444-2_63
fatcat:vjijb5ktrngfvl5vbbv77kuuci
ScaleUPC
2009
Proceedings of the Third Conference on Partitioned Global Address Space Programing Models - PGAS '09
model makes it a natural choice for a single multi-core machine, where the main memory is physically shared. ...
As the communication cost for remote accesses is removed because all accesses are physically local in a multi-core, we find that the overhead of pointer arithmetic on shared data accesses becomes a prominent ...
Acknowledgements We would like to thank the anonymous reviewers for their useful comments on this paper. This work is supported by NSF Career CCF-0643664, NSF CCF-0811427, and NSF CCF-0833082. ...
doi:10.1145/1809961.1809976
fatcat:g5q23kqffredjeb7rexpuicix4
Scalable Dynamic Load Balancing Using UPC
2008
2008 37th International Conference on Parallel Processing
Our implementation achieves better scaling and parallel efficiency in both shared memory and distributed memory settings than previous efforts using UPC [1] and MPI [2]. ...
However, to obtain performance portability with UPC in both shared memory and distributed memory settings requires the careful use of onesided reads and writes to minimize the impact of high latency communication ...
Acknowledgment The authors thank the Renaissance Computing Institute for the use of the Kitty Hawk cluster and the University of North Carolina for the use of the Topsail cluster and the SGI Altix. ...
doi:10.1109/icpp.2008.19
dblp:conf/icpp/OlivierP08
fatcat:wgivv2ozofgjlm6fvlgkuxrqvm
Combining Static and Dynamic Data Coalescing in Unified Parallel C
2016
IEEE Transactions on Parallel and Distributed Systems
Significant progress has been made in the development of programming languages and tools that are suitable for hybrid computer architectures that group several shared-memory multicores interconnected through ...
A performance evaluation reports both scaling and absolute performance numbers on up to 32768 cores of a Power 775 supercomputer. ...
BACKGROUND PGAS programming languages use the same programming model for local, shared and distributed memory hardware. ...
doi:10.1109/tpds.2015.2405551
fatcat:isr4fuw6nvfpzfo4abngauwame
NUMA Computing with Hardware and Software Co-Support on Configurable Emulated Shared Memory Architectures
2014
International Journal of Networking and Computing
The hardware techniques include three different NUMA shared memory access mechanisms and the software ones provide a mechanism to integrate and optimize NUMA computation into the standard parallel random ...
The emulated shared memory (ESM) architectures are good candidates for future general purpose parallel computers due to their ability to provide an easy-to-use explicitly parallel synchronous model of ...
Introduction Among the architectural approaches for parallel and multicore computing making use of memory -let it be distributed either on-chip or among a number of chips -there are very few that support ...
doi:10.15803/ijnc.4.1_189
fatcat:hmoejoyocncfhme3urknmvopku
« Previous
Showing results 1 — 15 out of 36,454 results