Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Filters








12,707 Hits in 7.3 sec

Memory Row Reuse Distance and its Role in Optimizing Application Performance

Mahmut Kandemir, Hui Zhao, Xulong Tang, Mustafa Karakoy
2015 Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems - SIGMETRICS '15  
Focusing on multithreaded applications, the first contribution of this paper is the definition of a new metric called (memory) row reuse distance (RRD).  ...  it on application performance.  ...  To our knowledge, this is the first paper that formalizes the (memory) row reuse distance concept, and demonstrates the impact of optimizing (minimizing) it on the row-buffer locality and overall application  ... 
doi:10.1145/2745844.2745867 dblp:conf/sigmetrics/KandemirZTK15 fatcat:zqueixkro5dlnjrwelnrcnaafm

ReMAP: Reuse and memory access cost aware eviction policy for last level cache management

Akhil Arunkumar, Carole-Jean Wu
2014 2014 IEEE 32nd International Conference on Computer Design (ICCD)  
In this paper, we show that in addition to the recency information provided by the cache replacement policy, post eviction reuse distance (PERD) and main memory access latency cost are useful to make betterinformed  ...  However, most prior works focus their efforts on optimizing cache miss counts experienced by applications, irrespective of the interactions between the LLC and other components in the memory hierarchy  ...  eviction reuse distance and memory access cost, we study the performance benefit achieved by each component in isolation for a few interesting applications.  ... 
doi:10.1109/iccd.2014.6974670 dblp:conf/iccd/ArunkumarW14 fatcat:ips52qvzr5fstooqkbmtnbazqu

Combined loop transformation and hierarchy allocation for data reuse optimization

Jason Cong, Peng Zhang, Yi Zou
2011 2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)  
External memory bandwidth is a crucial bottleneck in the majority of computation-intensive applications for both performance and power consumption.  ...  Loop transformation for data locality and memory hierarchy allocation are two major steps in data reuse optimization flow. But they were carried out independently.  ...  INTRODUCTION Memory systems play an increasingly important role in modern computation system design, for both general-purpose processors and application-specific accelerators.  ... 
doi:10.1109/iccad.2011.6105324 dblp:conf/iccad/CongZZ11 fatcat:4gmi6ouh5vgzdljdkk5nddggee

Trading cache hit rate for memory performance

Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli
2014 Proceedings of the 23rd international conference on Parallel architectures and compilation - PACT '14  
Second, it discusses a more aggressive strategy that sacrifices some cache performance in order to further improve row-buffer performance (i.e., it trades cache performance for memory system performance  ...  Most of the prior compiler based data locality optimization works target exclusively cache locality optimization, and row-buffer locality in DRAM banks received much less attention.  ...  Acknowledgment This research is supported in part by NSF grants #0963839, #1302557, #1213052, #1017882, and #1205618, a grant from INTEL an a grant from MICROSOFT.  ... 
doi:10.1145/2628071.2628082 dblp:conf/IEEEpact/DingKGJDY14 fatcat:a7j3vxei3bas3bbmasntq2bcua

Rubik: A Hierarchical Architecture for Efficient Graph Learning [article]

Xiaobing Chen, Yuke Wang, Xinfeng Xie, Xing Hu, Abanti Basak, Ling Liang, Mingyu Yan, Lei Deng, Yufei Ding, Zidong Du, Yunji Chen, Yuan Xie
2020 arXiv   pre-print
Graph convolutional network (GCN) emerges as a promising direction to learn the inductive representation in graph data commonly used in widespread applications, such as E-commerce, social networks, and  ...  However, learning from graphs is non-trivial because of its mixed computation model involving both graph analytics and neural network computing.  ...  Thus the data reuse optimization plays a more important role for larger graphs.  ... 
arXiv:2009.12495v1 fatcat:c7alktpjfjdzhbfmnsbwivv74a

Short-Circuiting Memory Traffic in Handheld Platforms

Praveen Yedlapalli, Nachiappan Chidambaram Nachiappan, Niranjan Soundararajan, Anand Sivasubramaniam, Mahmut T. Kandemir, Chita R. Das
2014 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture  
, caused by large IP-to-IP data reuse distances.  ...  In this work, we study workloads from these domains and identify the memory subsystem (system agent) to be a critical bottleneck to performance scaling.  ...  ACKNOWLEDGMENTS This research is supported in part by the following NSF grants -#1205618, #1213052, #1302225, #1302557, #1317560, #1320478, #1409095, #1439021, #1439057, and grants from Intel.  ... 
doi:10.1109/micro.2014.60 dblp:conf/micro/YedlapalliNSSKD14 fatcat:v2jdz3ts3fdszad2ycf2gntlr4

Boosting Performance Optimization with Interactive Data Movement Visualization [article]

Philipp Schaad and Tal Ben-Nun and Torsten Hoefler
2022 arXiv   pre-print
In particular, data movement and reuse play a crucial role in optimization and are often hard to improve without detailed program inspection.  ...  Case studies analyzing and optimizing real-world applications demonstrate our tool's effectiveness in guiding optimization decisions and making the performance tuning process more interactive.  ...  P.S. and T.B.N. are supported by the Swiss National Science Foundation (Ambizione Project No. 185778).  ... 
arXiv:2207.07433v2 fatcat:cldp5bkn3veilbksm3fzoqr7sm

Improving effective bandwidth through compiler enhancement of global cache reuse

Chen Ding, Ken Kennedy
2004 Journal of Parallel and Distributed Computing  
It investigates the potential for compiler optimizations to alter program behavior and reduce its memory bandwidth consumption.  ...  In order to carry out this strategy to its full extent, this research has developed a set of compiler transformations that perform computation fusion and data grouping over the whole program and during  ...  As compilers are taking an increasingly important role in optimizing the deep and complex memory hierarchy, their failure also becomes more dangerous and may lead to serious performance slowdown.  ... 
doi:10.1016/j.jpdc.2003.09.005 fatcat:lt762atuijgefjrr4mqpm6q3wm

Locality-Aware CTA Clustering for Modern GPUs

Ang Li, Shuaiwen Leon Song, Weifeng Liu, Xu Liu, Akash Kumar, Henk Corporaal
2017 Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS '17  
By leveraging these insights, we propose the concept of CTA-Clustering and its associated software-based techniques to reshape the default CTA scheduling in order to group the CTAs with potential reuse  ...  Through further quantification process, we prove the significance and commonality of such locality among GPU applications, and discuss whether such reuse is exploitable.  ...  Acknowledgments We would like to thank the anonymous reviewers for their constructive comments and suggestions for improving this work. This research is supported by the U.S.  ... 
doi:10.1145/3037697.3037709 dblp:conf/asplos/LiS0L0C17 fatcat:vw7yzpigbbhp5ml7alahsiggc4

Locality-Aware CTA Clustering for Modern GPUs

Ang Li, Shuaiwen Leon Song, Weifeng Liu, Xu Liu, Akash Kumar, Henk Corporaal
2017 SIGARCH Computer Architecture News  
By leveraging these insights, we propose the concept of CTA-Clustering and its associated software-based techniques to reshape the default CTA scheduling in order to group the CTAs with potential reuse  ...  Through further quantification process, we prove the significance and commonality of such locality among GPU applications, and discuss whether such reuse is exploitable.  ...  Acknowledgments We would like to thank the anonymous reviewers for their constructive comments and suggestions for improving this work. This research is supported by the U.S.  ... 
doi:10.1145/3093337.3037709 fatcat:4lmnja4e6veonfg2ojox7a6pbq

Locality-Aware CTA Clustering for Modern GPUs

Ang Li, Shuaiwen Leon Song, Weifeng Liu, Xu Liu, Akash Kumar, Henk Corporaal
2017 ACM SIGOPS Operating Systems Review  
By leveraging these insights, we propose the concept of CTA-Clustering and its associated software-based techniques to reshape the default CTA scheduling in order to group the CTAs with potential reuse  ...  Through further quantification process, we prove the significance and commonality of such locality among GPU applications, and discuss whether such reuse is exploitable.  ...  Acknowledgments We would like to thank the anonymous reviewers for their constructive comments and suggestions for improving this work. This research is supported by the U.S.  ... 
doi:10.1145/3093315.3037709 fatcat:h7vhnovsqndmxduewwjw5fpy6e

Locality-Aware CTA Clustering for Modern GPUs

Ang Li, Shuaiwen Leon Song, Weifeng Liu, Xu Liu, Akash Kumar, Henk Corporaal
2017 SIGPLAN notices  
By leveraging these insights, we propose the concept of CTA-Clustering and its associated software-based techniques to reshape the default CTA scheduling in order to group the CTAs with potential reuse  ...  Through further quantification process, we prove the significance and commonality of such locality among GPU applications, and discuss whether such reuse is exploitable.  ...  Acknowledgments We would like to thank the anonymous reviewers for their constructive comments and suggestions for improving this work. This research is supported by the U.S.  ... 
doi:10.1145/3093336.3037709 fatcat:summls6gdza45pmuih42753rhy

Page 332 of IEEE Transactions on Computers Vol. 52, Issue 3 [page]

2003 IEEE Transactions on Computers  
many application of a general memory other purposes.  ...  interesting to keep the last row and/or column.  ... 

Accurate prediction of the behavior of multithreaded applications in shared caches

Diego Andrade, Basilio B. Fraguela, Ramón Doallo
2013 Parallel Computing  
In other cases the performance gain can be increased due to a greater reuse of the data loaded in the cache.  ...  The theoretical performance gain expected when several cores cooperate in the parallel execution of an application can be reduced in some cases by a cache access bottleneck, as the data accessed by them  ...  Acknowledgments This work has been supported by the Galician Government under projects Consolidation of Competitive Research Groups (ref 2010/6), INCITE08PXIB105161PR and UDC/GI-000265, and the Ministry  ... 
doi:10.1016/j.parco.2012.11.003 fatcat:le4q364hkzcfrcaci7euouphou

Efficient exact K-nearest neighbor graph construction for billion-scale datasets using GPUs with tensor cores

Zhuoran Ji, Cho-Li Wang
2022 Proceedings of the 36th ACM International Conference on Supercomputing  
It deploys the distance matrix calculation to matrix multiplication units and adopts on-the-fly top-𝑘 selection to avoid transferring the exa-scale distance matrix to/from device memory. flyKNNG co-designs  ...  the two key algorithms to optimize the overall performance: the distance matrix calculation algorithm considers the data communication costs and pruning strategy of top-𝑘 selection; the top-𝑘 selection  ...  We appreciate ICS reviewers for their constructive comments and suggestions.  ... 
doi:10.1145/3524059.3532368 fatcat:srcd3qcxxrfzpnwdk3ppe7lewy
« Previous Showing results 1 — 15 out of 12,707 results