Gretch: A Hardware Prefetcher for Graph Analytics

Authors:
Anirudh Mohan Kaushik

University of Waterloo, Ontario, Canada

University of Waterloo, Ontario, Canada
View Profile

,
Gennady Pekhimenko

University of Toronto, Ontario, Canada

University of Toronto, Ontario, Canada
View Profile

,
Hiren Patel

University of Waterloo, Waterloo, Canada

University of Waterloo, Waterloo, Canada
View Profile

ACM Transactions on Architecture and Code Optimization Volume 18 Issue 2Article No.: 18pp 1–25https://doi.org/10.1145/3439803

Published:09 February 2021Publication History

ACM Transactions on Architecture and Code Optimization

Abstract

Data-dependent memory accesses (DDAs) pose an important challenge for high-performance graph analytics (GA). This is because such memory accesses do not exhibit enough temporal and spatial locality resulting in low cache performance. Prior efforts that focused on improving the performance of DDAs for GA are not applicable across various GA frameworks. This is because (1) they only focus on one particular graph representation, and (2) they require workload changes to communicate specific information to the hardware for their effective operation.

In this work, we propose a hardware-only solution to improving the performance of DDAs for GA across multiple GA frameworks. We present a hardware prefetcher for GA called Gretch, that addresses the above limitations. An important observation we make is that identifying certain DDAs without hardware-software communication is sensitive to the instruction scheduling. A key contribution of this work is a hardware mechanism that activates Gretch to identify DDAs when using either in-order or out-of-order instruction scheduling. Our evaluation shows that Gretch provides an average speedup of 38% over no prefetching, 25% over conventional stride prefetcher, and outperforms prior DDAs prefetchers by 22% with only 1% increase in power consumption when executed on different GA workloads and frameworks.

References

Neo4j [n.d.]. Neo4j graph database. Retrieved from http://neo4j.com/.Google Scholar
M. Ahmad, F. Hijaz, Q. Shi, and O. Khan. 2015. CRONO: A benchmark suite for multithreaded graph algorithms executing on futuristic multicores. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’15). IEEE, 44--55.Google Scholar
Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. A scalable processing-in-memory accelerator for parallel graph processing. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA’15). ACM, 105--117.Google ScholarDigital Library
Sam Ainsworth and Timothy M. Jones. 2016. Graph prefetching using data structure knowledge. In Proceedings of the International Conference on Supercomputing (ICS’16). ACM, 1--11.Google Scholar
S. Ainsworth and T. M. Jones. 2017. Software prefetching for indirect memory accesses. In Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization (CGO’17). IEEE/ACM, 305--217.Google Scholar
Sam Ainsworth and Timothy M. Jones. 2018. An event-triggered programmable prefetcher for irregular workloads. InProceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’18). ACM, 578--592.Google Scholar
Ayaz Akram and Lina Sawalha. 2019. Validation of the gem5 simulator for x86 architectures. In Proceedings of the Conference on IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS’19). IEEE, 53--58.Google ScholarCross Ref
L. M. AlBarakat, P. V. Gratz, and D. A. Jimenez. 2018. MTB-fetch: Multithreading aware hardware prefetching for chip multiprocessors. IEEE Comput. Architect. Lett. (2018), 175--178.Google Scholar
M. J. Anderson, N. Sundaram, N. Satish, M. M. A. Patwary, T. L. Willke, and P. Dubey. 2016. GraphPad: Optimized graph primitives for parallel and distributed platforms. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS’16). IEEE, 313--322.Google Scholar
M. Bakhshalipour, P. Lotfi-Kamran, and H. Sarbazi-Azad. 2018. Domino temporal data prefetcher. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’18). IEEE, 131--142.Google Scholar
V. Balaji and B. Lucia. 2018. When is graph reordering an optimization? Studying the effect of lightweight graph reordering across applications and input graphs. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’18). IEEE, 203--214.Google Scholar
Abanti Basak, Shuangchen Li, Xing Hu, Sang Min Oh, Xinfeng Xie, Li Zhao, Xiaowei Jiang, and Yuan Xie. 2019. Analysis and optimization of the memory hierarchy for graph processing workloads. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’19). IEEE, 373--386.Google ScholarCross Ref
Scott Beamer, Krste Asanovic, and David Patterson. 2012. Direction-optimizing breadth-first search. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’12). IEEE, 1--10.Google ScholarDigital Library
Scott Beamer, Krste Asanovic, and David Patterson. 2015. Locality exists in graph processing: Workload characterization on an ivy bridge server. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’15). IEEE, 56--65.Google ScholarDigital Library
Scott Beamer, Krste Asanović, and David Patterson. 2015. The GAP benchmark suite. Retrieved from https://arXiv:1508.03619.Google Scholar
Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, et al. 2011. The gem5 simulator. ACM SIGARCH Comput. Architect. News 39, 2 (2011), 1--7.Google ScholarDigital Library
Peter Boncz. 2013. LDBC: Benchmarks for graph and RDF data management. In Proceedings of the 17th International Database Engineering and Applications Symposium. ACM, 1--2.Google ScholarDigital Library
Anastasiia Butko, Rafael Garibotti, Luciano Ost, and Gilles Sassatelli. 2012. Accuracy evaluation of gem5 simulator system. In Proceedings of the International Workshop on Reconfigurable and Communication-centric Systems-on-chip (ReCoSoC’12). IEEE, 1--7.Google ScholarCross Ref
Mustafa Canim and Yuan-Chi Chang. 2013. System G data store: Big, rich graph data analytics in the cloud. In Proceedings of the IEEE International Conference on Cloud Engineering (IC2E’13). IEEE, 328--337.Google ScholarDigital Library
Mustafa Cavus, Resit Sendag, and Joshua J. Yi. 2020. Informed prefetching for indirect memory accesses. ACM Trans. Architect. Code Optimiz. (2020), 1--29.Google Scholar
Tien-Fu Chen and Jean-Loup Baer. 1995. Effective hardware-based data prefetching for high-performance processors. IEEE Trans. Comput. 44, 5 (1995), 609--623.Google ScholarDigital Library
Robert Cooksey, Stephan Jourdan, and Dirk Grunwald. 2002. A stateless, content-directed data prefetching mechanism. ACM SIGPLAN Notices 37, 10, 279--290.Google ScholarDigital Library
James Davidson, Benjamin Liebald, Junning Liu, Palash Nandy, Taylor Van Vleet, Ullas Gargi, Sujoy Gupta, Yu He, Mike Lambert, Blake Livingston, et al. 2010. The YouTube video recommendation system. In Proceedings of the 4th ACM Conference on Recommender Systems. 293--296.Google ScholarDigital Library
Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt. 2009. Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems. In Proceedings of the IEEE 15th International Symposium on High Performance Computer Architecture (HPCA’09). IEEE, 7--17.Google Scholar
D. Ediger, R. McColl, J. Riedy, and D. A. Bader. 2012. STINGER: High-performance data structure for streaming graphs. In Proceedings of the IEEE Conference on High Performance Extreme Computing. IEEE, 1--5.Google Scholar
Assaf Eisenman, Lucy Cherkasova, Guilherme Magalhaes, Qiong Cai, and Sachin Katti. 2016. Parallel graph processing on modern multi-core servers: New findings and remaining challenges. In Proceedings of the IEEE 24th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS’16). IEEE, 49--58.Google ScholarCross Ref
Facebook. 2013. Introducing Graph Search Beta. Retrieved from https://newsroom.fb.com/news/2013/01/introducing-graph-search-beta/.Google Scholar
Anthony Gutierrez, Joseph Pusdesris, Ronald G. Dreslinski, Trevor Mudge, Chander Sudanthi, Christopher D. Emmons, Mitchell Hayenga, and Nigel Paver. 2014. Sources of error in full-system simulation. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS’14). IEEE, 13--22.Google ScholarCross Ref
Tae Jun Ham, Lisa Wu, Narayanan Sundaram, Nadathur Satish, and Margaret Martonosi. 2016. Graphicionado: A high-performance and energy-efficient accelerator for graph analytics. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE, 1--13.Google ScholarCross Ref
Intel. 2016. Intel 64 and IA-32 architectures optimization reference manual (Section 12.1. 1), 2014. Retrieved from http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf.Google Scholar
Akanksha Jain and Calvin Lin. 2013. Linearizing irregular memory accesses for improved correlated prefetching. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’13). 247--259.Google ScholarDigital Library
Victor Jiménez, Roberto Gioiosa, Francisco J. Cazorla, Alper Buyuktosunoglu, Pradip Bose, and Francis P. O’Connell. 2012. Making data prefetch smarter: Adaptive prefetching on POWER7. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT’12). ACM, 137--146.Google Scholar
Norman P. Jouppi. 1990. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. ACM SIGARCH Comput. Architect. News 18, 2SI (1990), 364--373.Google ScholarDigital Library
Magnus Karlsson, Fredrik Dahlgren, and Per Stenstrom. 2000. A prefetching technique for irregular accesses to linked data structures. In Proceedings of the 6th International Symposium on High-Performance Computer Architecture (HPCA’00). IEEE, 206--217.Google Scholar
Jinchun Kim, Seth H. Pugsley, Paul V. Gratz, A. L. Narasimha Reddy, Chris Wilkerson, and Zeshan Chishti. 2016. Path confidence based lookahead prefetching. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE, 1--12.Google ScholarCross Ref
Jinchun Kim, Elvira Teran, Paul V. Gratz, Daniel A. Jiménez, Seth H. Pugsley, and Chris Wilkerson. 2017. Kill the program counter: Reconstructing program behavior in the processor cache hierarchy. ACM SIGPLAN Notices 52, 4 (2017), 737--749.Google ScholarDigital Library
Milind Kulkarni, Keshav Pingali, Bruce Walter, Ganesh Ramanarayanan, Kavita Bala, and L. Paul Chew. 2007. Optimistic parallelism requires abstractions. In Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’07). ACM, 211--222.Google Scholar
Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is Twitter, a social network or a news media? In Proceedings of the 19th International Conference on World Wide Web. ACM, 591--600.Google ScholarDigital Library
Andrew Lenharth, Donald Nguyen, and Keshav Pingali. 2016. Parallel graph analytics. Commun. ACM 59, 5 (2016), 78--87.Google ScholarDigital Library
Jure Leskovec and Rok Sosič. 2016. SNAP: A General-Purpose Network Analysis and Graph-Mining Library. ACM Trans. Intell. Syst. Technol. 8, 1, Article 1 (2016), 20 pages. https://doi.org/10.1145/2898361Google ScholarDigital Library
Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’09). ACM, 469--480.Google Scholar
Sheng Li, Ke Chen, Jung Ho Ahn, Jay B. Brockman, and Norman P. Jouppi. 2011. CACTI-P: Architecture-level modeling for SRAM-based structures with advanced leakage reduction techniques. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD’11). IEEE, 694--701.Google ScholarDigital Library
Greg Linden, Brent Smith, and Jeremy York. 2003. Amazon. com recommendations: Item-to-item collaborative filtering. IEEE Internet Comput. 7, 1 (2003), 76--80.Google ScholarDigital Library
Chi-Keung Luk. 2001. Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors. In Proceedings 28th Annual International Symposium on Computer Architecture (ISCA’01). IEEE, 40--51.Google ScholarDigital Library
Andrew Lumsdaine, Douglas Gregor, Bruce Hendrickson, and Jonathan Berry. 2007. Challenges in parallel graph processing. Parallel Process. Lett. 17, 01 (2007), 5--20.Google ScholarCross Ref
J. Luo, H. Cheng, I. Lin, and D. Chang. 2019. TAP: Reducing the energy of asymmetric hybrid last-level cache via thrashing aware placement and migration. IEEE Trans. Comput. (2019), 1704--1719.Google Scholar
P. M. Yaghini, G. Michelogiannakis, and P. V. Gratz. 2019. SpecLock: Speculative lock forwarding. In Proceedings of the International Conference on Computer Design (ICCD’19). IEEE, 273--282.Google Scholar
Vaibhav Mehta, Constantinos Bartzis, Haifeng Zhu, Edmund Clarke, and Jeannette Wing. 2006. Ranking attack graphs. In Proceedings of the International Workshop on Recent Advances in Intrusion Detection. Springer, 127--144.Google ScholarDigital Library
Pierre Michaud. 2016. Best-offset hardware prefetching. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’16). IEEE, 469--480.Google ScholarCross Ref
G. Michelogiannakis and J. Shalf. 2017. Last level collective hardware prefetching for data-parallel applications. In Proceedings of the International Conference on High Performance Computing (HiPC’17). IEEE, 72--83.Google Scholar
Rada Mihalcea and Dragomir Radev. 2011. Graph-based Natural Language Processing and Information Retrieval. Cambridge University Press.Google ScholarDigital Library
Todd C. Mowry, Monica S. Lam, and Anoop Gupta. 1992. Design and evaluation of a compiler algorithm for prefetching. ACM Sigplan Notices 27, 9, 62--73.Google ScholarDigital Library
Anurag Mukkara, Nathan Beckmann, Maleen Abeydeera, Xiaosong Ma, and Daniel Sanchez. 2018. Exploiting locality in graph analytics through hardware-accelerated traversal scheduling. In Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’18). IEEE, 1--14.Google ScholarDigital Library
Richard C. Murphy, Kyle B. Wheeler, Brian W. Barrett, and James A. Ang. 2010. Introducing the graph 500. Cray Users Group (CUG).Google Scholar
Lifeng Nai, Yinglong Xia, Ilie G. Tanase, Hyesoon Kim, and Ching-Yung Lin. 2015. GraphBIG: Understanding graph computing in the context of industrial solutions. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’15). IEEE, 1--12.Google ScholarDigital Library
Kyle J. Nesbit and James E. Smith. 2004. Data cache prefetching using a global history buffer. In Proceedings of the10th International Symposium on High Performance Computer Architecture (HPCA’04). IEEE, 96--96.Google Scholar
Muhammet Mustafa Ozdal, Serif Yesil, Taemin Kim, Andrey Ayupov, John Greth, Steven Burns, and Ozcan Ozturk. 2016. Energy efficient architecture for graph analytics accelerators. ACM SIGARCH Comput. Architect. News 44, 3 (2016), 166--177.Google ScholarDigital Library
Leeor Peled, Shie Mannor, Uri Weiser, and Yoav Etsion. 2015. Semantic locality and context-based prefetching using reinforcement learning. In Proceedings of the ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA’15). IEEE, 285--297.Google ScholarDigital Library
Seth H. Pugsley, Zeshan Chishti, Chris Wilkerson, Peng-fei Chuang, Robert L. Scott, Aamer Jaleel, Shih-Lien Lu, Kingsum Chow, and Rajeev Balasubramonian. 2014. Sandbox prefetching: Safe run-time evaluation of aggressive prefetchers. In Proceedings of the IEEE 20th International Symposium on High Performance Computer Architecture (HPCA’14). IEEE, 626--637.Google ScholarCross Ref
S. Ravi. 2016. Graph-powered Machine Learning at Google.Google Scholar
Amir Roth, Andreas Moshovos, and Gurindar S. Sohi. 1998. Dependence based prefetching for linked data structures. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’98). 115--126.Google Scholar
A. Roth and G. S. Sohi. 1999. Effective jump-pointer prefetching for linked data structures. In Proceedings of the 26th Annual International Symposium on Computer Architecture (ISCA’99). ACM, 111--121.Google Scholar
Manjunath Shevgoor, Sahil Koladiya, Rajeev Balasubramonian, Chris Wilkerson, Seth H. Pugsley, and Zeshan Chishti. 2015. Efficiently prefetching complex address patterns. In Proceedings of the 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’15). IEEE, 141--152.Google ScholarDigital Library
Julian Shun and Guy E. Blelloch. 2013. Ligra: A lightweight graph processing framework for shared memory. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 135--146.Google Scholar
Stuart Staniford-Chen, Steven Cheung, Richard Crawford, Mark Dilger, Jeremy Frank, James Hoagland, Karl Levitt, Christopher Wee, Raymond Yip, and Dan Zerkle. 1996. GrIDS-a graph based intrusion detection system for large networks. In Proceedings of the 19th National Information Systems Security Conference, Vol. 1. 361--370.Google Scholar
Xiangyao Yu, Christopher J. Hughes, Nadathur Satish, and Srinivas Devadas. 2015. IMP: Indirect memory prefetcher. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO’15). ACM, 178--190.Google ScholarDigital Library
Dan Zhang, Xiaoyu Ma, Michael Thomson, and Derek Chiou. 2018. Minnow: Lightweight offload engines for worklist management and worklist-directed prefetching. ACM SIGPLAN Notices 53, 2 (2018), 593--607.Google ScholarDigital Library

Index Terms

Gretch: A Hardware Prefetcher for Graph Analytics
1. Computer systems organization
  1. Architectures
2. General and reference
  1. Cross-computing tools and techniques
    1. Performance

Recommendations

Execution History Guided Instruction Prefetching

The increasing gap in performance between processors and main memory has made effective instructions prefetching techniques more important than ever. A major deficiency of existing prefetching methods is that most of them require an extra port to I-...
Read More
A Prefetch-Adaptive Intelligent Cache Replacement Policy Based on Machine Learning
Abstract
Hardware prefetching and replacement policies are two techniques to improve the performance of the memory subsystem. While prefetching hides memory latency and improves performance, interactions take place with the cache replacement policies, ...
Read More
CPpf: a prefetch aware LLC partitioning approach
ICPP '19: Proceedings of the 48th International Conference on Parallel Processing

Hardware cache prefetching is deployed in modern multicore processors to reduce memory latencies, addressing the memory wall problem. However, it tends to increase the Last Level Cache (LLC) contention among applications in multiprogrammed workloads, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Architecture and Code Optimization Volume 18, Issue 2
June 2021
190 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3450354
Editor:
David Kaeli
Northeastern University, USA
Issue’s Table of Contents
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 February 2021
- Revised: 1 November 2020
- Accepted: 1 November 2020
- Received: 1 January 2020
Published in taco Volume 18, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Hardware prefetching
data-dependent memory accesses
graph analytics
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 1,141
  Total Downloads
- Downloads (Last 12 months)203
- Downloads (Last 6 weeks)25
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Gretch: A Hardware Prefetcher for Graph Analytics

ACM Transactions on Architecture and Code Optimization

Abstract

References

Cited By

Index Terms

Recommendations

Execution History Guided Instruction Prefetching

A Prefetch-Adaptive Intelligent Cache Replacement Policy Based on Machine Learning

CPpf: a prefetch aware LLC partitioning approach