A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2019; you can also visit the original URL.
The file type is application/pdf
.
Filters
Memory Row Reuse Distance and its Role in Optimizing Application Performance
2015
Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems - SIGMETRICS '15
Focusing on multithreaded applications, the first contribution of this paper is the definition of a new metric called (memory) row reuse distance (RRD). ...
it on application performance. ...
To our knowledge, this is the first paper that formalizes the (memory) row reuse distance concept, and demonstrates the impact of optimizing (minimizing) it on the row-buffer locality and overall application ...
doi:10.1145/2745844.2745867
dblp:conf/sigmetrics/KandemirZTK15
fatcat:zqueixkro5dlnjrwelnrcnaafm
ReMAP: Reuse and memory access cost aware eviction policy for last level cache management
2014
2014 IEEE 32nd International Conference on Computer Design (ICCD)
In this paper, we show that in addition to the recency information provided by the cache replacement policy, post eviction reuse distance (PERD) and main memory access latency cost are useful to make betterinformed ...
However, most prior works focus their efforts on optimizing cache miss counts experienced by applications, irrespective of the interactions between the LLC and other components in the memory hierarchy ...
eviction reuse distance and memory access cost, we study the performance benefit achieved by each component in isolation for a few interesting applications. ...
doi:10.1109/iccd.2014.6974670
dblp:conf/iccd/ArunkumarW14
fatcat:ips52qvzr5fstooqkbmtnbazqu
Combined loop transformation and hierarchy allocation for data reuse optimization
2011
2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)
External memory bandwidth is a crucial bottleneck in the majority of computation-intensive applications for both performance and power consumption. ...
Loop transformation for data locality and memory hierarchy allocation are two major steps in data reuse optimization flow. But they were carried out independently. ...
INTRODUCTION Memory systems play an increasingly important role in modern computation system design, for both general-purpose processors and application-specific accelerators. ...
doi:10.1109/iccad.2011.6105324
dblp:conf/iccad/CongZZ11
fatcat:4gmi6ouh5vgzdljdkk5nddggee
Trading cache hit rate for memory performance
2014
Proceedings of the 23rd international conference on Parallel architectures and compilation - PACT '14
Second, it discusses a more aggressive strategy that sacrifices some cache performance in order to further improve row-buffer performance (i.e., it trades cache performance for memory system performance ...
Most of the prior compiler based data locality optimization works target exclusively cache locality optimization, and row-buffer locality in DRAM banks received much less attention. ...
Acknowledgment This research is supported in part by NSF grants #0963839, #1302557, #1213052, #1017882, and #1205618, a grant from INTEL an a grant from MICROSOFT. ...
doi:10.1145/2628071.2628082
dblp:conf/IEEEpact/DingKGJDY14
fatcat:a7j3vxei3bas3bbmasntq2bcua
Rubik: A Hierarchical Architecture for Efficient Graph Learning
[article]
2020
arXiv
pre-print
Graph convolutional network (GCN) emerges as a promising direction to learn the inductive representation in graph data commonly used in widespread applications, such as E-commerce, social networks, and ...
However, learning from graphs is non-trivial because of its mixed computation model involving both graph analytics and neural network computing. ...
Thus the data reuse optimization plays a more important role for larger graphs. ...
arXiv:2009.12495v1
fatcat:c7alktpjfjdzhbfmnsbwivv74a
Short-Circuiting Memory Traffic in Handheld Platforms
2014
2014 47th Annual IEEE/ACM International Symposium on Microarchitecture
, caused by large IP-to-IP data reuse distances. ...
In this work, we study workloads from these domains and identify the memory subsystem (system agent) to be a critical bottleneck to performance scaling. ...
ACKNOWLEDGMENTS This research is supported in part by the following NSF grants -#1205618, #1213052, #1302225, #1302557, #1317560, #1320478, #1409095, #1439021, #1439057, and grants from Intel. ...
doi:10.1109/micro.2014.60
dblp:conf/micro/YedlapalliNSSKD14
fatcat:v2jdz3ts3fdszad2ycf2gntlr4
Boosting Performance Optimization with Interactive Data Movement Visualization
[article]
2022
arXiv
pre-print
In particular, data movement and reuse play a crucial role in optimization and are often hard to improve without detailed program inspection. ...
Case studies analyzing and optimizing real-world applications demonstrate our tool's effectiveness in guiding optimization decisions and making the performance tuning process more interactive. ...
P.S. and T.B.N. are supported by the Swiss National Science Foundation (Ambizione Project No. 185778). ...
arXiv:2207.07433v2
fatcat:cldp5bkn3veilbksm3fzoqr7sm
Improving effective bandwidth through compiler enhancement of global cache reuse
2004
Journal of Parallel and Distributed Computing
It investigates the potential for compiler optimizations to alter program behavior and reduce its memory bandwidth consumption. ...
In order to carry out this strategy to its full extent, this research has developed a set of compiler transformations that perform computation fusion and data grouping over the whole program and during ...
As compilers are taking an increasingly important role in optimizing the deep and complex memory hierarchy, their failure also becomes more dangerous and may lead to serious performance slowdown. ...
doi:10.1016/j.jpdc.2003.09.005
fatcat:lt762atuijgefjrr4mqpm6q3wm
Locality-Aware CTA Clustering for Modern GPUs
2017
Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS '17
By leveraging these insights, we propose the concept of CTA-Clustering and its associated software-based techniques to reshape the default CTA scheduling in order to group the CTAs with potential reuse ...
Through further quantification process, we prove the significance and commonality of such locality among GPU applications, and discuss whether such reuse is exploitable. ...
Acknowledgments We would like to thank the anonymous reviewers for their constructive comments and suggestions for improving this work. This research is supported by the U.S. ...
doi:10.1145/3037697.3037709
dblp:conf/asplos/LiS0L0C17
fatcat:vw7yzpigbbhp5ml7alahsiggc4
Locality-Aware CTA Clustering for Modern GPUs
2017
SIGARCH Computer Architecture News
By leveraging these insights, we propose the concept of CTA-Clustering and its associated software-based techniques to reshape the default CTA scheduling in order to group the CTAs with potential reuse ...
Through further quantification process, we prove the significance and commonality of such locality among GPU applications, and discuss whether such reuse is exploitable. ...
Acknowledgments We would like to thank the anonymous reviewers for their constructive comments and suggestions for improving this work. This research is supported by the U.S. ...
doi:10.1145/3093337.3037709
fatcat:4lmnja4e6veonfg2ojox7a6pbq
Locality-Aware CTA Clustering for Modern GPUs
2017
ACM SIGOPS Operating Systems Review
By leveraging these insights, we propose the concept of CTA-Clustering and its associated software-based techniques to reshape the default CTA scheduling in order to group the CTAs with potential reuse ...
Through further quantification process, we prove the significance and commonality of such locality among GPU applications, and discuss whether such reuse is exploitable. ...
Acknowledgments We would like to thank the anonymous reviewers for their constructive comments and suggestions for improving this work. This research is supported by the U.S. ...
doi:10.1145/3093315.3037709
fatcat:h7vhnovsqndmxduewwjw5fpy6e
Locality-Aware CTA Clustering for Modern GPUs
2017
SIGPLAN notices
By leveraging these insights, we propose the concept of CTA-Clustering and its associated software-based techniques to reshape the default CTA scheduling in order to group the CTAs with potential reuse ...
Through further quantification process, we prove the significance and commonality of such locality among GPU applications, and discuss whether such reuse is exploitable. ...
Acknowledgments We would like to thank the anonymous reviewers for their constructive comments and suggestions for improving this work. This research is supported by the U.S. ...
doi:10.1145/3093336.3037709
fatcat:summls6gdza45pmuih42753rhy
Page 332 of IEEE Transactions on Computers Vol. 52, Issue 3
[page]
2003
IEEE Transactions on Computers
many
application of a general memory other purposes. ...
interesting to keep the last
row and/or column. ...
Accurate prediction of the behavior of multithreaded applications in shared caches
2013
Parallel Computing
In other cases the performance gain can be increased due to a greater reuse of the data loaded in the cache. ...
The theoretical performance gain expected when several cores cooperate in the parallel execution of an application can be reduced in some cases by a cache access bottleneck, as the data accessed by them ...
Acknowledgments This work has been supported by the Galician Government under projects Consolidation of Competitive Research Groups (ref 2010/6), INCITE08PXIB105161PR and UDC/GI-000265, and the Ministry ...
doi:10.1016/j.parco.2012.11.003
fatcat:le4q364hkzcfrcaci7euouphou
Efficient exact K-nearest neighbor graph construction for billion-scale datasets using GPUs with tensor cores
2022
Proceedings of the 36th ACM International Conference on Supercomputing
It deploys the distance matrix calculation to matrix multiplication units and adopts on-the-fly top-𝑘 selection to avoid transferring the exa-scale distance matrix to/from device memory. flyKNNG co-designs ...
the two key algorithms to optimize the overall performance: the distance matrix calculation algorithm considers the data communication costs and pruning strategy of top-𝑘 selection; the top-𝑘 selection ...
We appreciate ICS reviewers for their constructive comments and suggestions. ...
doi:10.1145/3524059.3532368
fatcat:srcd3qcxxrfzpnwdk3ppe7lewy
« Previous
Showing results 1 — 15 out of 12,707 results