A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2019; you can also visit the original URL.
The file type is application/pdf
.
Filters
Memory-aware Thread and Data Mapping for Hierarchical Multi-core Platforms
2012
International Journal of Networking and Computing
The problem is even more important in multi-core machines with NUMA characteristics, since the remote access imposes high overhead, making them more sensitive to thread and data mapping. ...
In this context, thread and data mapping are techniques that provide performance gains by improving the use of resources such as interconnections, main memory and cache memory. ...
Acknowledgment This research has been partially supported by the CAPES under grant 4874-06-4 and CNPq. ...
doi:10.15803/ijnc.2.1_97
fatcat:pcbmir2eirc4dbobn47efmplcq
Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes
2009
2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing
Today most systems in high-performance computing (HPC) feature a hierarchical hardware design: Shared memory nodes with several multi-core CPUs are connected via a network infrastructure. ...
Furthermore we show that machine topology has a significant impact on performance for all parallelization strategies and that topology awareness should be built into all applications in the future. ...
Fruitful discussions with Rainer Keller and Gerhard Wellein are gratefully acknowledged. ...
doi:10.1109/pdp.2009.43
dblp:conf/pdp/RabenseifnerHJ09
fatcat:jqwqavp655bvpdkavvhk3so64y
Efficient simulation of agent-based models on multi-GPU and multi-core clusters
2010
Proceedings of the 3rd International ICST Conference on Simulation Tools and Techniques
Message Passing Interface (MPI) is used for inter-GPU as well as inter-socket communication on a cluster of multiple GPUs and multi-core processors. ...
pthreads on multi-core processors. ...
Accordingly, the United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paidup, irrevocable ...
doi:10.4108/icst.simutools2010.8822
dblp:conf/simutools/AabyPS10
fatcat:ewcei2izsnh4bbymwmqjhqwd4y
GRapid: A compilation and runtime framework for rapid prototyping of graph applications on many-core processors
2014
2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)
While compilation and runtime frameworks for parallelizing graph applications on multi-core CPUs exist, there is still a need for comparable frameworks for many-core devices. ...
We propose GRapid: a compilation and runtime framework that generates efficient parallel implementations of generic graph applications for multi-core CPUs, NVIDIA GPUs and Intel Xeon Phi. ...
Note that our framework also produces multi-threaded code for multi-core CPUs. ...
doi:10.1109/padsw.2014.7097806
dblp:conf/icpads/LiCB14
fatcat:getyhmefjfdlzbfsy4uawh46au
On the effectiveness of OpenMP teams for cluster-based many-core accelerators
2016
2016 International Conference on High Performance Computing & Simulation (HPCS)
Application developers are indeed required to manually deal with outlining code parts suitable for acceleration, parallelize them efficiently over many available cores, and orchestrate data transfers to ...
should also take care of properly mapping the parallel computation so as to avoid poor data locality. ...
In a scratchpad-based architecture, the master thread is typically responsible for bringing data in and out via DMA transfers, thus it is extremely important that the thread-to-core mapping follows a cluster-aware ...
doi:10.1109/hpcsim.2016.7568399
dblp:conf/ieeehpcs/CapotondiM16
fatcat:ipkzykxisncbvchukkpnl2b3lu
Exposing the Locality of Heterogeneous Memory Architectures to HPC Applications
2016
Proceedings of the Second International Symposium on Memory Systems - MEMSYS '16
High-performance computing requires a deep knowledge of the hardware platform to fully exploit its computing power. The performance of data transfer between cores and memory is becoming critical. ...
Indeed, tasks and data have to be carefully distributed on the computing and memory resources. ...
ACKNOWLEDGMENTS We would like to thank Intel for providing us with hints for designing our new hwloc model. ...
doi:10.1145/2989081.2989115
dblp:conf/memsys/Goglin16
fatcat:eev2v2bomzcdri2gnsnfyn3fey
Architecture Aware Programming on Multi-Core Systems
2011
International Journal of Advanced Computer Science and Applications
In this paper, we propose a programming approach for the algorithms running on shared memory multi-core systems by using blocking, which is a wellknown optimization technique coupled with parallel programming ...
With the advent of multi-core architectures, we are facing the problem that is new to parallel computing, namely, the management of hierarchical caches. ...
For a shared memory platform, all the cores on a single die share the same memory subsystem, and there is no direct support for binding the threads to the core using OpeMP. ...
doi:10.14569/ijacsa.2011.020615
fatcat:oxappnveqjetpoetnq2c5zfnzq
Enabling and scaling biomolecular simulations of 100 million atoms on petascale machines with a multicore-optimized message-driven runtime
2011
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11
We exploit node-aware techniques to optimize both the application and the underlying SMP runtime. ...
Hierarchical load balancing is further exploited to scale NAMD to the full Jaguar PF Cray XT5 (224,076 cores) at Oak Ridge National Laboratory, both with and without PME full electrostatics, achieving ...
Acknowledgments This work was supported in part by a NIH Grant PHS 5 P41 RR05969-04 for Molecular Dynamics, by NSF grant OCI-0725070 for Blue Waters deployment, by the Institute for Advanced Computing ...
doi:10.1145/2063384.2063466
dblp:conf/sc/MeiSZBKPH11
fatcat:jgkzpijiljgl3hxfhl2vevglyu
A memory-centric approach to enable timing-predictability within embedded many-core accelerators
2015
2015 CSI Symposium on Real-Time and Embedded Systems and Technologies (RTEST)
There is an increasing interest among real-time systems architects for multi-and many-core accelerated platforms. ...
In this paper, we study how the predictable execution model (PREM), a memory-aware approach to enable timing-predictability in realtime systems, can be successfully adopted on multi-and manycore heterogeneous ...
With a memory-aware task mapping and scheduling algorithm in place 3 , it would be then possible to select which task to assign to this "unlucky" thread, reserving the higher priority threads for more ...
doi:10.1109/rtest.2015.7369851
fatcat:dz44lvdm5fffxkiwrjvfkbvtqy
Topology-Aware Mapping Techniques for Heterogeneous HPC Systems: A Systematic Survey
2018
International Journal of Advanced Computer Science and Applications
In this survey paper, we have studied various topology-aware mapping techniques and algorithms. ...
Given that, the efficient topology-aware process mapping has become vital to efficiently optimize the data locality management in order to improve the system performance and energy consumption. ...
[23] used the network/node architecture and graph embedding modules for mapping the application communication topology onto the multi-core clusters physical topology with multi-level networks. ...
doi:10.14569/ijacsa.2018.091045
fatcat:taeescyyjjej7pbqutm4kulagy
Dynamic Task and Data Placement over NUMA Architectures: An OpenMP Runtime Perspective
[chapter]
2009
Lecture Notes in Computer Science
Our runtime, which is based on a multi-level thread scheduler combined with a NUMA-aware memory manager, converts this information into "scheduling hints" to solve thread/memory affinity issues. ...
First experiments show that mixed solutions (migrating threads and data) outperform next-touch-based data distribution policies and open possibilities for new optimizations. ...
These features enable memory-aware task and data placement but they remain expensive. ...
doi:10.1007/978-3-642-02303-3_7
fatcat:e4cc3wengncrnfrvcnbtmvbmtq
An Optimized Model for MapReduce Based on Hadoop
2016
TELKOMNIKA (Telecommunication Computing Electronics and Control)
From the perspective of fine-grained parallel data processing, combined with Fork/Join framework,a parallel and multi-thread model,this paper optimizes MapReduce model and puts forward a MapReduce+Fork ...
shared and distributed memory machines. ...
Acknowledgements We acknowledge the support from various grant sources: the Natural Science Foundation of Gansu Province (Grant No.148RJZA019), the Scientific and Technological support program Foundation ...
doi:10.12928/telkomnika.v14i4.3606
fatcat:kcettjgtq5d3dkvad3d647dj3q
Structuring the execution of OpenMP applications for multicore architectures
2010
2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)
ForestGOMP features a highlevel platform for developing and tuning portable threads schedulers. ...
The now commonplace multi-core chips have introduced, by design, a deep hierarchy of memory and cache banks within parallel computers as a tradeoff between the user friendliness of shared memory on the ...
CONCLUSION AND FUTURE WORK FORESTGOMP is a platform for executing and tuning OpenMP programs over hierarchical multicore architectures. ...
doi:10.1109/ipdps.2010.5470442
dblp:conf/ipps/BroquedisAGTWN10
fatcat:y3xatov5zvhn3op7b3ququbrom
NUMA-Aware DGEMM Based on 64-Bit ARMv8 Multicore Processors Architecture
2021
Electronics
The critical enabler for NUMA-aware DGEMM is to leverage two levels of parallelism between and within nodes in a purely threaded implementation, which allows the task independence and data localization ...
We present a NUMA-aware method to reduce the number of cross-die and cross-chip memory access events. ...
Acknowledgments: The authors thank ZhiGuang Chen and Nong Xiao for their guidance and the server provided by Pengcheng Labs.
Conflicts of Interest: The authors declare no conflict of interest. ...
doi:10.3390/electronics10161984
fatcat:mkevjicswjfpzcdrtsfarp2dnq
A survey on hardware-aware and heterogeneous computing on multicore processors and accelerators
2011
Concurrency and Computation
In particular, we characterize the discrepancy to conventional parallel platforms with respect to hierarchical memory sub-systems, fine-grained parallelism on several system levels, and chip-and system-level ...
Performance gains for data-and compute-intensive applications can currently only be achieved by exploiting coarse-and fine-grained parallelism on all system levels, and improved scalability with respect ...
Acknowledgements The Shared Research Group 16-1 received financial support by the Concept for the Future of Karlsruhe Institute of Technology in the framework of the German Excellence Initiative and the ...
doi:10.1002/cpe.1904
fatcat:fwg2vjaobral3b2v46vq4x2c3q
« Previous
Showing results 1 — 15 out of 9,371 results