Memory-aware Thread and Data Mapping for Hierarchical Multi-core Platforms.

The problem is even more important in multi-core machines with NUMA characteristics, since the remote access imposes high overhead, making them more sensitive to thread and data mapping. ... In this context, thread and data mapping are techniques that provide performance gains by improving the use of resources such as interconnections, main memory and cache memory. ... Acknowledgment This research has been partially supported by the CAPES under grant 4874-06-4 and CNPq. ...

doi:10.15803/ijnc.2.1_97 fatcat:pcbmir2eirc4dbobn47efmplcq

Open Access

Today most systems in high-performance computing (HPC) feature a hierarchical hardware design: Shared memory nodes with several multi-core CPUs are connected via a network infrastructure. ... Furthermore we show that machine topology has a significant impact on performance for all parallelization strategies and that topology awareness should be built into all applications in the future. ... Fruitful discussions with Rainer Keller and Gerhard Wellein are gratefully acknowledged. ...

doi:10.1109/pdp.2009.43 dblp:conf/pdp/RabenseifnerHJ09 fatcat:jqwqavp655bvpdkavvhk3so64y

Message Passing Interface (MPI) is used for inter-GPU as well as inter-socket communication on a cluster of multiple GPUs and multi-core processors. ... pthreads on multi-core processors. ... Accordingly, the United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paidup, irrevocable ...

doi:10.4108/icst.simutools2010.8822 dblp:conf/simutools/AabyPS10 fatcat:ewcei2izsnh4bbymwmqjhqwd4y

While compilation and runtime frameworks for parallelizing graph applications on multi-core CPUs exist, there is still a need for comparable frameworks for many-core devices. ... We propose GRapid: a compilation and runtime framework that generates efficient parallel implementations of generic graph applications for multi-core CPUs, NVIDIA GPUs and Intel Xeon Phi. ... Note that our framework also produces multi-threaded code for multi-core CPUs. ...

doi:10.1109/padsw.2014.7097806 dblp:conf/icpads/LiCB14 fatcat:getyhmefjfdlzbfsy4uawh46au

Application developers are indeed required to manually deal with outlining code parts suitable for acceleration, parallelize them efficiently over many available cores, and orchestrate data transfers to ... should also take care of properly mapping the parallel computation so as to avoid poor data locality. ... In a scratchpad-based architecture, the master thread is typically responsible for bringing data in and out via DMA transfers, thus it is extremely important that the thread-to-core mapping follows a cluster-aware ...

doi:10.1109/hpcsim.2016.7568399 dblp:conf/ieeehpcs/CapotondiM16 fatcat:ipkzykxisncbvchukkpnl2b3lu

High-performance computing requires a deep knowledge of the hardware platform to fully exploit its computing power. The performance of data transfer between cores and memory is becoming critical. ... Indeed, tasks and data have to be carefully distributed on the computing and memory resources. ... ACKNOWLEDGMENTS We would like to thank Intel for providing us with hints for designing our new hwloc model. ...

doi:10.1145/2989081.2989115 dblp:conf/memsys/Goglin16 fatcat:eev2v2bomzcdri2gnsnfyn3fey

In this paper, we propose a programming approach for the algorithms running on shared memory multi-core systems by using blocking, which is a wellknown optimization technique coupled with parallel programming ... With the advent of multi-core architectures, we are facing the problem that is new to parallel computing, namely, the management of hierarchical caches. ... For a shared memory platform, all the cores on a single die share the same memory subsystem, and there is no direct support for binding the threads to the core using OpeMP. ...

doi:10.14569/ijacsa.2011.020615 fatcat:oxappnveqjetpoetnq2c5zfnzq

Szczepanski

We exploit node-aware techniques to optimize both the application and the underlying SMP runtime. ... Hierarchical load balancing is further exploited to scale NAMD to the full Jaguar PF Cray XT5 (224,076 cores) at Oak Ridge National Laboratory, both with and without PME full electrostatics, achieving ... Acknowledgments This work was supported in part by a NIH Grant PHS 5 P41 RR05969-04 for Molecular Dynamics, by NSF grant OCI-0725070 for Blue Waters deployment, by the Institute for Advanced Computing ...

doi:10.1145/2063384.2063466 dblp:conf/sc/MeiSZBKPH11 fatcat:jgkzpijiljgl3hxfhl2vevglyu

Citation

Chao Mei, Yanhua Sun, Gengbin Zheng, Eric J. Bohm, Laxmikant V. Kale, James C. Phillips, Chris Harrison. "Enabling and scaling biomolecular simulations of 100 million atoms on petascale machines with a multicore-optimized message-driven runtime." Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 (2011) 61:1-61:11

There is an increasing interest among real-time systems architects for multi-and many-core accelerated platforms. ... In this paper, we study how the predictable execution model (PREM), a memory-aware approach to enable timing-predictability in realtime systems, can be successfully adopted on multi-and manycore heterogeneous ... With a memory-aware task mapping and scheduling algorithm in place 3 , it would be then possible to select which task to assign to this "unlucky" thread, reserving the higher priority threads for more ...

doi:10.1109/rtest.2015.7369851 fatcat:dz44lvdm5fffxkiwrjvfkbvtqy

In this survey paper, we have studied various topology-aware mapping techniques and algorithms. ... Given that, the efficient topology-aware process mapping has become vital to efficiently optimize the data locality management in order to improve the system performance and energy consumption. ... [23] used the network/node architecture and graph embedding modules for mapping the application communication topology onto the multi-core clusters physical topology with multi-level networks. ...

doi:10.14569/ijacsa.2018.091045 fatcat:taeescyyjjej7pbqutm4kulagy

Szczepanski

Our runtime, which is based on a multi-level thread scheduler combined with a NUMA-aware memory manager, converts this information into "scheduling hints" to solve thread/memory affinity issues. ... First experiments show that mixed solutions (migrating threads and data) outperform next-touch-based data distribution policies and open possibilities for new optimizations. ... These features enable memory-aware task and data placement but they remain expensive. ...

doi:10.1007/978-3-642-02303-3_7 fatcat:e4cc3wengncrnfrvcnbtmvbmtq

From the perspective of fine-grained parallel data processing, combined with Fork/Join framework,a parallel and multi-thread model,this paper optimizes MapReduce model and puts forward a MapReduce+Fork ... shared and distributed memory machines. ... Acknowledgements We acknowledge the support from various grant sources: the Natural Science Foundation of Gansu Province (Grant No.148RJZA019), the Scientific and Technological support program Foundation ...

doi:10.12928/telkomnika.v14i4.3606 fatcat:kcettjgtq5d3dkvad3d647dj3q

Szczepanski

ForestGOMP features a highlevel platform for developing and tuning portable threads schedulers. ... The now commonplace multi-core chips have introduced, by design, a deep hierarchy of memory and cache banks within parallel computers as a tradeoff between the user friendliness of shared memory on the ... CONCLUSION AND FUTURE WORK FORESTGOMP is a platform for executing and tuning OpenMP programs over hierarchical multicore architectures. ...

doi:10.1109/ipdps.2010.5470442 dblp:conf/ipps/BroquedisAGTWN10 fatcat:y3xatov5zvhn3op7b3ququbrom

The critical enabler for NUMA-aware DGEMM is to leverage two levels of parallelism between and within nodes in a purely threaded implementation, which allows the task independence and data localization ... We present a NUMA-aware method to reduce the number of cross-die and cross-chip memory access events. ... Acknowledgments: The authors thank ZhiGuang Chen and Nong Xiao for their guidance and the server provided by Pengcheng Labs. Conflicts of Interest: The authors declare no conflict of interest. ...

doi:10.3390/electronics10161984 fatcat:mkevjicswjfpzcdrtsfarp2dnq

DOAJ

In particular, we characterize the discrepancy to conventional parallel platforms with respect to hierarchical memory sub-systems, fine-grained parallelism on several system levels, and chip-and system-level ... Performance gains for data-and compute-intensive applications can currently only be achieved by exploiting coarse-and fine-grained parallelism on all system levels, and improved scalability with respect ... Acknowledgements The Shared Research Group 16-1 received financial support by the Concept for the Future of Karlsruhe Institute of Technology in the framework of the German Excellence Initiative and the ...

doi:10.1002/cpe.1904 fatcat:fwg2vjaobral3b2v46vq4x2c3q

Memory-aware Thread and Data Mapping for Hierarchical Multi-core Platforms

Preserved Fulltext

Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes

Preserved Fulltext

Efficient simulation of agent-based models on multi-GPU and multi-core clusters

Preserved Fulltext

GRapid: A compilation and runtime framework for rapid prototyping of graph applications on many-core processors

Preserved Fulltext

On the effectiveness of OpenMP teams for cluster-based many-core accelerators

Preserved Fulltext

Exposing the Locality of Heterogeneous Memory Architectures to HPC Applications

Preserved Fulltext

Architecture Aware Programming on Multi-Core Systems

Preserved Fulltext

Enabling and scaling biomolecular simulations of 100 million atoms on petascale machines with a multicore-optimized message-driven runtime

Preserved Fulltext

A memory-centric approach to enable timing-predictability within embedded many-core accelerators

Preserved Fulltext

Topology-Aware Mapping Techniques for Heterogeneous HPC Systems: A Systematic Survey

Preserved Fulltext

Dynamic Task and Data Placement over NUMA Architectures: An OpenMP Runtime Perspective [chapter]

Preserved Fulltext

An Optimized Model for MapReduce Based on Hadoop

Preserved Fulltext

Structuring the execution of OpenMP applications for multicore architectures

Preserved Fulltext

NUMA-Aware DGEMM Based on 64-Bit ARMv8 Multicore Processors Architecture

Preserved Fulltext

A survey on hardware-aware and heterogeneous computing on multicore processors and accelerators

Preserved Fulltext