Abstract
Data-dependent memory accesses (DDAs) pose an important challenge for high-performance graph analytics (GA). This is because such memory accesses do not exhibit enough temporal and spatial locality resulting in low cache performance. Prior efforts that focused on improving the performance of DDAs for GA are not applicable across various GA frameworks. This is because (1) they only focus on one particular graph representation, and (2) they require workload changes to communicate specific information to the hardware for their effective operation.
In this work, we propose a hardware-only solution to improving the performance of DDAs for GA across multiple GA frameworks. We present a hardware prefetcher for GA called Gretch, that addresses the above limitations. An important observation we make is that identifying certain DDAs without hardware-software communication is sensitive to the instruction scheduling. A key contribution of this work is a hardware mechanism that activates Gretch to identify DDAs when using either in-order or out-of-order instruction scheduling. Our evaluation shows that Gretch provides an average speedup of 38% over no prefetching, 25% over conventional stride prefetcher, and outperforms prior DDAs prefetchers by 22% with only 1% increase in power consumption when executed on different GA workloads and frameworks.
- Neo4j [n.d.]. Neo4j graph database. Retrieved from http://neo4j.com/.Google Scholar
- M. Ahmad, F. Hijaz, Q. Shi, and O. Khan. 2015. CRONO: A benchmark suite for multithreaded graph algorithms executing on futuristic multicores. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’15). IEEE, 44--55.Google Scholar
- Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. A scalable processing-in-memory accelerator for parallel graph processing. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA’15). ACM, 105--117.Google ScholarDigital Library
- Sam Ainsworth and Timothy M. Jones. 2016. Graph prefetching using data structure knowledge. In Proceedings of the International Conference on Supercomputing (ICS’16). ACM, 1--11.Google Scholar
- S. Ainsworth and T. M. Jones. 2017. Software prefetching for indirect memory accesses. In Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization (CGO’17). IEEE/ACM, 305--217.Google Scholar
- Sam Ainsworth and Timothy M. Jones. 2018. An event-triggered programmable prefetcher for irregular workloads. InProceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’18). ACM, 578--592.Google Scholar
- Ayaz Akram and Lina Sawalha. 2019. Validation of the gem5 simulator for x86 architectures. In Proceedings of the Conference on IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS’19). IEEE, 53--58.Google ScholarCross Ref
- L. M. AlBarakat, P. V. Gratz, and D. A. Jimenez. 2018. MTB-fetch: Multithreading aware hardware prefetching for chip multiprocessors. IEEE Comput. Architect. Lett. (2018), 175--178.Google Scholar
- M. J. Anderson, N. Sundaram, N. Satish, M. M. A. Patwary, T. L. Willke, and P. Dubey. 2016. GraphPad: Optimized graph primitives for parallel and distributed platforms. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS’16). IEEE, 313--322.Google Scholar
- M. Bakhshalipour, P. Lotfi-Kamran, and H. Sarbazi-Azad. 2018. Domino temporal data prefetcher. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’18). IEEE, 131--142.Google Scholar
- V. Balaji and B. Lucia. 2018. When is graph reordering an optimization? Studying the effect of lightweight graph reordering across applications and input graphs. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’18). IEEE, 203--214.Google Scholar
- Abanti Basak, Shuangchen Li, Xing Hu, Sang Min Oh, Xinfeng Xie, Li Zhao, Xiaowei Jiang, and Yuan Xie. 2019. Analysis and optimization of the memory hierarchy for graph processing workloads. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’19). IEEE, 373--386.Google ScholarCross Ref
- Scott Beamer, Krste Asanovic, and David Patterson. 2012. Direction-optimizing breadth-first search. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’12). IEEE, 1--10.Google ScholarDigital Library
- Scott Beamer, Krste Asanovic, and David Patterson. 2015. Locality exists in graph processing: Workload characterization on an ivy bridge server. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’15). IEEE, 56--65.Google ScholarDigital Library
- Scott Beamer, Krste Asanović, and David Patterson. 2015. The GAP benchmark suite. Retrieved from https://arXiv:1508.03619.Google Scholar
- Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, et al. 2011. The gem5 simulator. ACM SIGARCH Comput. Architect. News 39, 2 (2011), 1--7.Google ScholarDigital Library
- Peter Boncz. 2013. LDBC: Benchmarks for graph and RDF data management. In Proceedings of the 17th International Database Engineering and Applications Symposium. ACM, 1--2.Google ScholarDigital Library
- Anastasiia Butko, Rafael Garibotti, Luciano Ost, and Gilles Sassatelli. 2012. Accuracy evaluation of gem5 simulator system. In Proceedings of the International Workshop on Reconfigurable and Communication-centric Systems-on-chip (ReCoSoC’12). IEEE, 1--7.Google ScholarCross Ref
- Mustafa Canim and Yuan-Chi Chang. 2013. System G data store: Big, rich graph data analytics in the cloud. In Proceedings of the IEEE International Conference on Cloud Engineering (IC2E’13). IEEE, 328--337.Google ScholarDigital Library
- Mustafa Cavus, Resit Sendag, and Joshua J. Yi. 2020. Informed prefetching for indirect memory accesses. ACM Trans. Architect. Code Optimiz. (2020), 1--29.Google Scholar
- Tien-Fu Chen and Jean-Loup Baer. 1995. Effective hardware-based data prefetching for high-performance processors. IEEE Trans. Comput. 44, 5 (1995), 609--623.Google ScholarDigital Library
- Robert Cooksey, Stephan Jourdan, and Dirk Grunwald. 2002. A stateless, content-directed data prefetching mechanism. ACM SIGPLAN Notices 37, 10, 279--290.Google ScholarDigital Library
- James Davidson, Benjamin Liebald, Junning Liu, Palash Nandy, Taylor Van Vleet, Ullas Gargi, Sujoy Gupta, Yu He, Mike Lambert, Blake Livingston, et al. 2010. The YouTube video recommendation system. In Proceedings of the 4th ACM Conference on Recommender Systems. 293--296.Google ScholarDigital Library
- Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt. 2009. Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems. In Proceedings of the IEEE 15th International Symposium on High Performance Computer Architecture (HPCA’09). IEEE, 7--17.Google Scholar
- D. Ediger, R. McColl, J. Riedy, and D. A. Bader. 2012. STINGER: High-performance data structure for streaming graphs. In Proceedings of the IEEE Conference on High Performance Extreme Computing. IEEE, 1--5.Google Scholar
- Assaf Eisenman, Lucy Cherkasova, Guilherme Magalhaes, Qiong Cai, and Sachin Katti. 2016. Parallel graph processing on modern multi-core servers: New findings and remaining challenges. In Proceedings of the IEEE 24th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS’16). IEEE, 49--58.Google ScholarCross Ref
- Facebook. 2013. Introducing Graph Search Beta. Retrieved from https://newsroom.fb.com/news/2013/01/introducing-graph-search-beta/.Google Scholar
- Anthony Gutierrez, Joseph Pusdesris, Ronald G. Dreslinski, Trevor Mudge, Chander Sudanthi, Christopher D. Emmons, Mitchell Hayenga, and Nigel Paver. 2014. Sources of error in full-system simulation. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS’14). IEEE, 13--22.Google ScholarCross Ref
- Tae Jun Ham, Lisa Wu, Narayanan Sundaram, Nadathur Satish, and Margaret Martonosi. 2016. Graphicionado: A high-performance and energy-efficient accelerator for graph analytics. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE, 1--13.Google ScholarCross Ref
- Intel. 2016. Intel 64 and IA-32 architectures optimization reference manual (Section 12.1. 1), 2014. Retrieved from http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf.Google Scholar
- Akanksha Jain and Calvin Lin. 2013. Linearizing irregular memory accesses for improved correlated prefetching. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’13). 247--259.Google ScholarDigital Library
- Victor Jiménez, Roberto Gioiosa, Francisco J. Cazorla, Alper Buyuktosunoglu, Pradip Bose, and Francis P. O’Connell. 2012. Making data prefetch smarter: Adaptive prefetching on POWER7. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT’12). ACM, 137--146.Google Scholar
- Norman P. Jouppi. 1990. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. ACM SIGARCH Comput. Architect. News 18, 2SI (1990), 364--373.Google ScholarDigital Library
- Magnus Karlsson, Fredrik Dahlgren, and Per Stenstrom. 2000. A prefetching technique for irregular accesses to linked data structures. In Proceedings of the 6th International Symposium on High-Performance Computer Architecture (HPCA’00). IEEE, 206--217.Google Scholar
- Jinchun Kim, Seth H. Pugsley, Paul V. Gratz, A. L. Narasimha Reddy, Chris Wilkerson, and Zeshan Chishti. 2016. Path confidence based lookahead prefetching. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE, 1--12.Google ScholarCross Ref
- Jinchun Kim, Elvira Teran, Paul V. Gratz, Daniel A. Jiménez, Seth H. Pugsley, and Chris Wilkerson. 2017. Kill the program counter: Reconstructing program behavior in the processor cache hierarchy. ACM SIGPLAN Notices 52, 4 (2017), 737--749.Google ScholarDigital Library
- Milind Kulkarni, Keshav Pingali, Bruce Walter, Ganesh Ramanarayanan, Kavita Bala, and L. Paul Chew. 2007. Optimistic parallelism requires abstractions. In Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’07). ACM, 211--222.Google Scholar
- Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is Twitter, a social network or a news media? In Proceedings of the 19th International Conference on World Wide Web. ACM, 591--600.Google ScholarDigital Library
- Andrew Lenharth, Donald Nguyen, and Keshav Pingali. 2016. Parallel graph analytics. Commun. ACM 59, 5 (2016), 78--87.Google ScholarDigital Library
- Jure Leskovec and Rok Sosič. 2016. SNAP: A General-Purpose Network Analysis and Graph-Mining Library. ACM Trans. Intell. Syst. Technol. 8, 1, Article 1 (2016), 20 pages. https://doi.org/10.1145/2898361Google ScholarDigital Library
- Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’09). ACM, 469--480.Google Scholar
- Sheng Li, Ke Chen, Jung Ho Ahn, Jay B. Brockman, and Norman P. Jouppi. 2011. CACTI-P: Architecture-level modeling for SRAM-based structures with advanced leakage reduction techniques. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD’11). IEEE, 694--701.Google ScholarDigital Library
- Greg Linden, Brent Smith, and Jeremy York. 2003. Amazon. com recommendations: Item-to-item collaborative filtering. IEEE Internet Comput. 7, 1 (2003), 76--80.Google ScholarDigital Library
- Chi-Keung Luk. 2001. Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors. In Proceedings 28th Annual International Symposium on Computer Architecture (ISCA’01). IEEE, 40--51.Google ScholarDigital Library
- Andrew Lumsdaine, Douglas Gregor, Bruce Hendrickson, and Jonathan Berry. 2007. Challenges in parallel graph processing. Parallel Process. Lett. 17, 01 (2007), 5--20.Google ScholarCross Ref
- J. Luo, H. Cheng, I. Lin, and D. Chang. 2019. TAP: Reducing the energy of asymmetric hybrid last-level cache via thrashing aware placement and migration. IEEE Trans. Comput. (2019), 1704--1719.Google Scholar
- P. M. Yaghini, G. Michelogiannakis, and P. V. Gratz. 2019. SpecLock: Speculative lock forwarding. In Proceedings of the International Conference on Computer Design (ICCD’19). IEEE, 273--282.Google Scholar
- Vaibhav Mehta, Constantinos Bartzis, Haifeng Zhu, Edmund Clarke, and Jeannette Wing. 2006. Ranking attack graphs. In Proceedings of the International Workshop on Recent Advances in Intrusion Detection. Springer, 127--144.Google ScholarDigital Library
- Pierre Michaud. 2016. Best-offset hardware prefetching. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’16). IEEE, 469--480.Google ScholarCross Ref
- G. Michelogiannakis and J. Shalf. 2017. Last level collective hardware prefetching for data-parallel applications. In Proceedings of the International Conference on High Performance Computing (HiPC’17). IEEE, 72--83.Google Scholar
- Rada Mihalcea and Dragomir Radev. 2011. Graph-based Natural Language Processing and Information Retrieval. Cambridge University Press.Google ScholarDigital Library
- Todd C. Mowry, Monica S. Lam, and Anoop Gupta. 1992. Design and evaluation of a compiler algorithm for prefetching. ACM Sigplan Notices 27, 9, 62--73.Google ScholarDigital Library
- Anurag Mukkara, Nathan Beckmann, Maleen Abeydeera, Xiaosong Ma, and Daniel Sanchez. 2018. Exploiting locality in graph analytics through hardware-accelerated traversal scheduling. In Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’18). IEEE, 1--14.Google ScholarDigital Library
- Richard C. Murphy, Kyle B. Wheeler, Brian W. Barrett, and James A. Ang. 2010. Introducing the graph 500. Cray Users Group (CUG).Google Scholar
- Lifeng Nai, Yinglong Xia, Ilie G. Tanase, Hyesoon Kim, and Ching-Yung Lin. 2015. GraphBIG: Understanding graph computing in the context of industrial solutions. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’15). IEEE, 1--12.Google ScholarDigital Library
- Kyle J. Nesbit and James E. Smith. 2004. Data cache prefetching using a global history buffer. In Proceedings of the10th International Symposium on High Performance Computer Architecture (HPCA’04). IEEE, 96--96.Google Scholar
- Muhammet Mustafa Ozdal, Serif Yesil, Taemin Kim, Andrey Ayupov, John Greth, Steven Burns, and Ozcan Ozturk. 2016. Energy efficient architecture for graph analytics accelerators. ACM SIGARCH Comput. Architect. News 44, 3 (2016), 166--177.Google ScholarDigital Library
- Leeor Peled, Shie Mannor, Uri Weiser, and Yoav Etsion. 2015. Semantic locality and context-based prefetching using reinforcement learning. In Proceedings of the ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA’15). IEEE, 285--297.Google ScholarDigital Library
- Seth H. Pugsley, Zeshan Chishti, Chris Wilkerson, Peng-fei Chuang, Robert L. Scott, Aamer Jaleel, Shih-Lien Lu, Kingsum Chow, and Rajeev Balasubramonian. 2014. Sandbox prefetching: Safe run-time evaluation of aggressive prefetchers. In Proceedings of the IEEE 20th International Symposium on High Performance Computer Architecture (HPCA’14). IEEE, 626--637.Google ScholarCross Ref
- S. Ravi. 2016. Graph-powered Machine Learning at Google.Google Scholar
- Amir Roth, Andreas Moshovos, and Gurindar S. Sohi. 1998. Dependence based prefetching for linked data structures. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’98). 115--126.Google Scholar
- A. Roth and G. S. Sohi. 1999. Effective jump-pointer prefetching for linked data structures. In Proceedings of the 26th Annual International Symposium on Computer Architecture (ISCA’99). ACM, 111--121.Google Scholar
- Manjunath Shevgoor, Sahil Koladiya, Rajeev Balasubramonian, Chris Wilkerson, Seth H. Pugsley, and Zeshan Chishti. 2015. Efficiently prefetching complex address patterns. In Proceedings of the 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’15). IEEE, 141--152.Google ScholarDigital Library
- Julian Shun and Guy E. Blelloch. 2013. Ligra: A lightweight graph processing framework for shared memory. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 135--146.Google Scholar
- Stuart Staniford-Chen, Steven Cheung, Richard Crawford, Mark Dilger, Jeremy Frank, James Hoagland, Karl Levitt, Christopher Wee, Raymond Yip, and Dan Zerkle. 1996. GrIDS-a graph based intrusion detection system for large networks. In Proceedings of the 19th National Information Systems Security Conference, Vol. 1. 361--370.Google Scholar
- Xiangyao Yu, Christopher J. Hughes, Nadathur Satish, and Srinivas Devadas. 2015. IMP: Indirect memory prefetcher. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO’15). ACM, 178--190.Google ScholarDigital Library
- Dan Zhang, Xiaoyu Ma, Michael Thomson, and Derek Chiou. 2018. Minnow: Lightweight offload engines for worklist management and worklist-directed prefetching. ACM SIGPLAN Notices 53, 2 (2018), 593--607.Google ScholarDigital Library
Index Terms
- Gretch: A Hardware Prefetcher for Graph Analytics
Recommendations
Execution History Guided Instruction Prefetching
The increasing gap in performance between processors and main memory has made effective instructions prefetching techniques more important than ever. A major deficiency of existing prefetching methods is that most of them require an extra port to I-...
A Prefetch-Adaptive Intelligent Cache Replacement Policy Based on Machine Learning
AbstractHardware prefetching and replacement policies are two techniques to improve the performance of the memory subsystem. While prefetching hides memory latency and improves performance, interactions take place with the cache replacement policies, ...
CPpf: a prefetch aware LLC partitioning approach
ICPP '19: Proceedings of the 48th International Conference on Parallel ProcessingHardware cache prefetching is deployed in modern multicore processors to reduce memory latencies, addressing the memory wall problem. However, it tends to increase the Last Level Cache (LLC) contention among applications in multiprogrammed workloads, ...
Comments