Memory Row Reuse Distance and its Role in Optimizing Application Performance.

Focusing on multithreaded applications, the first contribution of this paper is the definition of a new metric called (memory) row reuse distance (RRD). ... it on application performance. ... To our knowledge, this is the first paper that formalizes the (memory) row reuse distance concept, and demonstrates the impact of optimizing (minimizing) it on the row-buffer locality and overall application ...

doi:10.1145/2745844.2745867 dblp:conf/sigmetrics/KandemirZTK15 fatcat:zqueixkro5dlnjrwelnrcnaafm

In this paper, we show that in addition to the recency information provided by the cache replacement policy, post eviction reuse distance (PERD) and main memory access latency cost are useful to make betterinformed ... However, most prior works focus their efforts on optimizing cache miss counts experienced by applications, irrespective of the interactions between the LLC and other components in the memory hierarchy ... eviction reuse distance and memory access cost, we study the performance benefit achieved by each component in isolation for a few interesting applications. ...

doi:10.1109/iccd.2014.6974670 dblp:conf/iccd/ArunkumarW14 fatcat:ips52qvzr5fstooqkbmtnbazqu

External memory bandwidth is a crucial bottleneck in the majority of computation-intensive applications for both performance and power consumption. ... Loop transformation for data locality and memory hierarchy allocation are two major steps in data reuse optimization flow. But they were carried out independently. ... INTRODUCTION Memory systems play an increasingly important role in modern computation system design, for both general-purpose processors and application-specific accelerators. ...

doi:10.1109/iccad.2011.6105324 dblp:conf/iccad/CongZZ11 fatcat:4gmi6ouh5vgzdljdkk5nddggee

Second, it discusses a more aggressive strategy that sacrifices some cache performance in order to further improve row-buffer performance (i.e., it trades cache performance for memory system performance ... Most of the prior compiler based data locality optimization works target exclusively cache locality optimization, and row-buffer locality in DRAM banks received much less attention. ... Acknowledgment This research is supported in part by NSF grants #0963839, #1302557, #1213052, #1017882, and #1205618, a grant from INTEL an a grant from MICROSOFT. ...

doi:10.1145/2628071.2628082 dblp:conf/IEEEpact/DingKGJDY14 fatcat:a7j3vxei3bas3bbmasntq2bcua

Graph convolutional network (GCN) emerges as a promising direction to learn the inductive representation in graph data commonly used in widespread applications, such as E-commerce, social networks, and ... However, learning from graphs is non-trivial because of its mixed computation model involving both graph analytics and neural network computing. ... Thus the data reuse optimization plays a more important role for larger graphs. ...

arXiv:2009.12495v1 fatcat:c7alktpjfjdzhbfmnsbwivv74a

, caused by large IP-to-IP data reuse distances. ... In this work, we study workloads from these domains and identify the memory subsystem (system agent) to be a critical bottleneck to performance scaling. ... ACKNOWLEDGMENTS This research is supported in part by the following NSF grants -#1205618, #1213052, #1302225, #1302557, #1317560, #1320478, #1409095, #1439021, #1439057, and grants from Intel. ...

doi:10.1109/micro.2014.60 dblp:conf/micro/YedlapalliNSSKD14 fatcat:v2jdz3ts3fdszad2ycf2gntlr4

In particular, data movement and reuse play a crucial role in optimization and are often hard to improve without detailed program inspection. ... Case studies analyzing and optimizing real-world applications demonstrate our tool's effectiveness in guiding optimization decisions and making the performance tuning process more interactive. ... P.S. and T.B.N. are supported by the Swiss National Science Foundation (Ambizione Project No. 185778). ...

arXiv:2207.07433v2 fatcat:cldp5bkn3veilbksm3fzoqr7sm

Multiple Versions

It investigates the potential for compiler optimizations to alter program behavior and reduce its memory bandwidth consumption. ... In order to carry out this strategy to its full extent, this research has developed a set of compiler transformations that perform computation fusion and data grouping over the whole program and during ... As compilers are taking an increasingly important role in optimizing the deep and complex memory hierarchy, their failure also becomes more dangerous and may lead to serious performance slowdown. ...

doi:10.1016/j.jpdc.2003.09.005 fatcat:lt762atuijgefjrr4mqpm6q3wm

By leveraging these insights, we propose the concept of CTA-Clustering and its associated software-based techniques to reshape the default CTA scheduling in order to group the CTAs with potential reuse ... Through further quantification process, we prove the significance and commonality of such locality among GPU applications, and discuss whether such reuse is exploitable. ... Acknowledgments We would like to thank the anonymous reviewers for their constructive comments and suggestions for improving this work. This research is supported by the U.S. ...

doi:10.1145/3037697.3037709 dblp:conf/asplos/LiS0L0C17 fatcat:vw7yzpigbbhp5ml7alahsiggc4

By leveraging these insights, we propose the concept of CTA-Clustering and its associated software-based techniques to reshape the default CTA scheduling in order to group the CTAs with potential reuse ... Through further quantification process, we prove the significance and commonality of such locality among GPU applications, and discuss whether such reuse is exploitable. ... Acknowledgments We would like to thank the anonymous reviewers for their constructive comments and suggestions for improving this work. This research is supported by the U.S. ...

doi:10.1145/3093337.3037709 fatcat:4lmnja4e6veonfg2ojox7a6pbq

By leveraging these insights, we propose the concept of CTA-Clustering and its associated software-based techniques to reshape the default CTA scheduling in order to group the CTAs with potential reuse ... Through further quantification process, we prove the significance and commonality of such locality among GPU applications, and discuss whether such reuse is exploitable. ... Acknowledgments We would like to thank the anonymous reviewers for their constructive comments and suggestions for improving this work. This research is supported by the U.S. ...

doi:10.1145/3093315.3037709 fatcat:h7vhnovsqndmxduewwjw5fpy6e

By leveraging these insights, we propose the concept of CTA-Clustering and its associated software-based techniques to reshape the default CTA scheduling in order to group the CTAs with potential reuse ... Through further quantification process, we prove the significance and commonality of such locality among GPU applications, and discuss whether such reuse is exploitable. ... Acknowledgments We would like to thank the anonymous reviewers for their constructive comments and suggestions for improving this work. This research is supported by the U.S. ...

doi:10.1145/3093336.3037709 fatcat:summls6gdza45pmuih42753rhy

many application of a general memory other purposes. ... interesting to keep the last row and/or column. ...

In other cases the performance gain can be increased due to a greater reuse of the data loaded in the cache. ... The theoretical performance gain expected when several cores cooperate in the parallel execution of an application can be reduced in some cases by a cache access bottleneck, as the data accessed by them ... Acknowledgments This work has been supported by the Galician Government under projects Consolidation of Competitive Research Groups (ref 2010/6), INCITE08PXIB105161PR and UDC/GI-000265, and the Ministry ...

doi:10.1016/j.parco.2012.11.003 fatcat:le4q364hkzcfrcaci7euouphou

It deploys the distance matrix calculation to matrix multiplication units and adopts on-the-fly top-𝑘 selection to avoid transferring the exa-scale distance matrix to/from device memory. flyKNNG co-designs ... the two key algorithms to optimize the overall performance: the distance matrix calculation algorithm considers the data communication costs and pruning strategy of top-𝑘 selection; the top-𝑘 selection ... We appreciate ICS reviewers for their constructive comments and suggestions. ...

doi:10.1145/3524059.3532368 fatcat:srcd3qcxxrfzpnwdk3ppe7lewy

Memory Row Reuse Distance and its Role in Optimizing Application Performance

Preserved Fulltext

ReMAP: Reuse and memory access cost aware eviction policy for last level cache management

Preserved Fulltext

Combined loop transformation and hierarchy allocation for data reuse optimization

Preserved Fulltext

Trading cache hit rate for memory performance

Preserved Fulltext

Rubik: A Hierarchical Architecture for Efficient Graph Learning [article]

Preserved Fulltext

Short-Circuiting Memory Traffic in Handheld Platforms

Preserved Fulltext

Boosting Performance Optimization with Interactive Data Movement Visualization [article]

Preserved Fulltext

Other Versions

Improving effective bandwidth through compiler enhancement of global cache reuse

Preserved Fulltext

Locality-Aware CTA Clustering for Modern GPUs

Preserved Fulltext

Locality-Aware CTA Clustering for Modern GPUs

Preserved Fulltext

Locality-Aware CTA Clustering for Modern GPUs

Preserved Fulltext

Locality-Aware CTA Clustering for Modern GPUs

Preserved Fulltext

Page 332 of IEEE Transactions on Computers Vol. 52, Issue 3 [page]

Preserved Fulltext

Accurate prediction of the behavior of multithreaded applications in shared caches

Preserved Fulltext

Efficient exact K-nearest neighbor graph construction for billion-scale datasets using GPUs with tensor cores

Preserved Fulltext