Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open Access

Gretch: A Hardware Prefetcher for Graph Analytics

Published:09 February 2021Publication History
Skip Abstract Section

Abstract

Data-dependent memory accesses (DDAs) pose an important challenge for high-performance graph analytics (GA). This is because such memory accesses do not exhibit enough temporal and spatial locality resulting in low cache performance. Prior efforts that focused on improving the performance of DDAs for GA are not applicable across various GA frameworks. This is because (1) they only focus on one particular graph representation, and (2) they require workload changes to communicate specific information to the hardware for their effective operation.

In this work, we propose a hardware-only solution to improving the performance of DDAs for GA across multiple GA frameworks. We present a hardware prefetcher for GA called Gretch, that addresses the above limitations. An important observation we make is that identifying certain DDAs without hardware-software communication is sensitive to the instruction scheduling. A key contribution of this work is a hardware mechanism that activates Gretch to identify DDAs when using either in-order or out-of-order instruction scheduling. Our evaluation shows that Gretch provides an average speedup of 38% over no prefetching, 25% over conventional stride prefetcher, and outperforms prior DDAs prefetchers by 22% with only 1% increase in power consumption when executed on different GA workloads and frameworks.

References

  1. Neo4j [n.d.]. Neo4j graph database. Retrieved from http://neo4j.com/.Google ScholarGoogle Scholar
  2. M. Ahmad, F. Hijaz, Q. Shi, and O. Khan. 2015. CRONO: A benchmark suite for multithreaded graph algorithms executing on futuristic multicores. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’15). IEEE, 44--55.Google ScholarGoogle Scholar
  3. Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. A scalable processing-in-memory accelerator for parallel graph processing. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA’15). ACM, 105--117.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Sam Ainsworth and Timothy M. Jones. 2016. Graph prefetching using data structure knowledge. In Proceedings of the International Conference on Supercomputing (ICS’16). ACM, 1--11.Google ScholarGoogle Scholar
  5. S. Ainsworth and T. M. Jones. 2017. Software prefetching for indirect memory accesses. In Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization (CGO’17). IEEE/ACM, 305--217.Google ScholarGoogle Scholar
  6. Sam Ainsworth and Timothy M. Jones. 2018. An event-triggered programmable prefetcher for irregular workloads. InProceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’18). ACM, 578--592.Google ScholarGoogle Scholar
  7. Ayaz Akram and Lina Sawalha. 2019. Validation of the gem5 simulator for x86 architectures. In Proceedings of the Conference on IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS’19). IEEE, 53--58.Google ScholarGoogle ScholarCross RefCross Ref
  8. L. M. AlBarakat, P. V. Gratz, and D. A. Jimenez. 2018. MTB-fetch: Multithreading aware hardware prefetching for chip multiprocessors. IEEE Comput. Architect. Lett. (2018), 175--178.Google ScholarGoogle Scholar
  9. M. J. Anderson, N. Sundaram, N. Satish, M. M. A. Patwary, T. L. Willke, and P. Dubey. 2016. GraphPad: Optimized graph primitives for parallel and distributed platforms. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS’16). IEEE, 313--322.Google ScholarGoogle Scholar
  10. M. Bakhshalipour, P. Lotfi-Kamran, and H. Sarbazi-Azad. 2018. Domino temporal data prefetcher. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’18). IEEE, 131--142.Google ScholarGoogle Scholar
  11. V. Balaji and B. Lucia. 2018. When is graph reordering an optimization? Studying the effect of lightweight graph reordering across applications and input graphs. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’18). IEEE, 203--214.Google ScholarGoogle Scholar
  12. Abanti Basak, Shuangchen Li, Xing Hu, Sang Min Oh, Xinfeng Xie, Li Zhao, Xiaowei Jiang, and Yuan Xie. 2019. Analysis and optimization of the memory hierarchy for graph processing workloads. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’19). IEEE, 373--386.Google ScholarGoogle ScholarCross RefCross Ref
  13. Scott Beamer, Krste Asanovic, and David Patterson. 2012. Direction-optimizing breadth-first search. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’12). IEEE, 1--10.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Scott Beamer, Krste Asanovic, and David Patterson. 2015. Locality exists in graph processing: Workload characterization on an ivy bridge server. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’15). IEEE, 56--65.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Scott Beamer, Krste Asanović, and David Patterson. 2015. The GAP benchmark suite. Retrieved from https://arXiv:1508.03619.Google ScholarGoogle Scholar
  16. Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, et al. 2011. The gem5 simulator. ACM SIGARCH Comput. Architect. News 39, 2 (2011), 1--7.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Peter Boncz. 2013. LDBC: Benchmarks for graph and RDF data management. In Proceedings of the 17th International Database Engineering and Applications Symposium. ACM, 1--2.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Anastasiia Butko, Rafael Garibotti, Luciano Ost, and Gilles Sassatelli. 2012. Accuracy evaluation of gem5 simulator system. In Proceedings of the International Workshop on Reconfigurable and Communication-centric Systems-on-chip (ReCoSoC’12). IEEE, 1--7.Google ScholarGoogle ScholarCross RefCross Ref
  19. Mustafa Canim and Yuan-Chi Chang. 2013. System G data store: Big, rich graph data analytics in the cloud. In Proceedings of the IEEE International Conference on Cloud Engineering (IC2E’13). IEEE, 328--337.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Mustafa Cavus, Resit Sendag, and Joshua J. Yi. 2020. Informed prefetching for indirect memory accesses. ACM Trans. Architect. Code Optimiz. (2020), 1--29.Google ScholarGoogle Scholar
  21. Tien-Fu Chen and Jean-Loup Baer. 1995. Effective hardware-based data prefetching for high-performance processors. IEEE Trans. Comput. 44, 5 (1995), 609--623.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Robert Cooksey, Stephan Jourdan, and Dirk Grunwald. 2002. A stateless, content-directed data prefetching mechanism. ACM SIGPLAN Notices 37, 10, 279--290.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. James Davidson, Benjamin Liebald, Junning Liu, Palash Nandy, Taylor Van Vleet, Ullas Gargi, Sujoy Gupta, Yu He, Mike Lambert, Blake Livingston, et al. 2010. The YouTube video recommendation system. In Proceedings of the 4th ACM Conference on Recommender Systems. 293--296.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt. 2009. Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems. In Proceedings of the IEEE 15th International Symposium on High Performance Computer Architecture (HPCA’09). IEEE, 7--17.Google ScholarGoogle Scholar
  25. D. Ediger, R. McColl, J. Riedy, and D. A. Bader. 2012. STINGER: High-performance data structure for streaming graphs. In Proceedings of the IEEE Conference on High Performance Extreme Computing. IEEE, 1--5.Google ScholarGoogle Scholar
  26. Assaf Eisenman, Lucy Cherkasova, Guilherme Magalhaes, Qiong Cai, and Sachin Katti. 2016. Parallel graph processing on modern multi-core servers: New findings and remaining challenges. In Proceedings of the IEEE 24th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS’16). IEEE, 49--58.Google ScholarGoogle ScholarCross RefCross Ref
  27. Facebook. 2013. Introducing Graph Search Beta. Retrieved from https://newsroom.fb.com/news/2013/01/introducing-graph-search-beta/.Google ScholarGoogle Scholar
  28. Anthony Gutierrez, Joseph Pusdesris, Ronald G. Dreslinski, Trevor Mudge, Chander Sudanthi, Christopher D. Emmons, Mitchell Hayenga, and Nigel Paver. 2014. Sources of error in full-system simulation. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS’14). IEEE, 13--22.Google ScholarGoogle ScholarCross RefCross Ref
  29. Tae Jun Ham, Lisa Wu, Narayanan Sundaram, Nadathur Satish, and Margaret Martonosi. 2016. Graphicionado: A high-performance and energy-efficient accelerator for graph analytics. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE, 1--13.Google ScholarGoogle ScholarCross RefCross Ref
  30. Intel. 2016. Intel 64 and IA-32 architectures optimization reference manual (Section 12.1. 1), 2014. Retrieved from http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf.Google ScholarGoogle Scholar
  31. Akanksha Jain and Calvin Lin. 2013. Linearizing irregular memory accesses for improved correlated prefetching. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’13). 247--259.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Victor Jiménez, Roberto Gioiosa, Francisco J. Cazorla, Alper Buyuktosunoglu, Pradip Bose, and Francis P. O’Connell. 2012. Making data prefetch smarter: Adaptive prefetching on POWER7. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT’12). ACM, 137--146.Google ScholarGoogle Scholar
  33. Norman P. Jouppi. 1990. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. ACM SIGARCH Comput. Architect. News 18, 2SI (1990), 364--373.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Magnus Karlsson, Fredrik Dahlgren, and Per Stenstrom. 2000. A prefetching technique for irregular accesses to linked data structures. In Proceedings of the 6th International Symposium on High-Performance Computer Architecture (HPCA’00). IEEE, 206--217.Google ScholarGoogle Scholar
  35. Jinchun Kim, Seth H. Pugsley, Paul V. Gratz, A. L. Narasimha Reddy, Chris Wilkerson, and Zeshan Chishti. 2016. Path confidence based lookahead prefetching. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE, 1--12.Google ScholarGoogle ScholarCross RefCross Ref
  36. Jinchun Kim, Elvira Teran, Paul V. Gratz, Daniel A. Jiménez, Seth H. Pugsley, and Chris Wilkerson. 2017. Kill the program counter: Reconstructing program behavior in the processor cache hierarchy. ACM SIGPLAN Notices 52, 4 (2017), 737--749.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Milind Kulkarni, Keshav Pingali, Bruce Walter, Ganesh Ramanarayanan, Kavita Bala, and L. Paul Chew. 2007. Optimistic parallelism requires abstractions. In Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’07). ACM, 211--222.Google ScholarGoogle Scholar
  38. Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is Twitter, a social network or a news media? In Proceedings of the 19th International Conference on World Wide Web. ACM, 591--600.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Andrew Lenharth, Donald Nguyen, and Keshav Pingali. 2016. Parallel graph analytics. Commun. ACM 59, 5 (2016), 78--87.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Jure Leskovec and Rok Sosič. 2016. SNAP: A General-Purpose Network Analysis and Graph-Mining Library. ACM Trans. Intell. Syst. Technol. 8, 1, Article 1 (2016), 20 pages. https://doi.org/10.1145/2898361Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’09). ACM, 469--480.Google ScholarGoogle Scholar
  42. Sheng Li, Ke Chen, Jung Ho Ahn, Jay B. Brockman, and Norman P. Jouppi. 2011. CACTI-P: Architecture-level modeling for SRAM-based structures with advanced leakage reduction techniques. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD’11). IEEE, 694--701.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Greg Linden, Brent Smith, and Jeremy York. 2003. Amazon. com recommendations: Item-to-item collaborative filtering. IEEE Internet Comput. 7, 1 (2003), 76--80.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Chi-Keung Luk. 2001. Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors. In Proceedings 28th Annual International Symposium on Computer Architecture (ISCA’01). IEEE, 40--51.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Andrew Lumsdaine, Douglas Gregor, Bruce Hendrickson, and Jonathan Berry. 2007. Challenges in parallel graph processing. Parallel Process. Lett. 17, 01 (2007), 5--20.Google ScholarGoogle ScholarCross RefCross Ref
  46. J. Luo, H. Cheng, I. Lin, and D. Chang. 2019. TAP: Reducing the energy of asymmetric hybrid last-level cache via thrashing aware placement and migration. IEEE Trans. Comput. (2019), 1704--1719.Google ScholarGoogle Scholar
  47. P. M. Yaghini, G. Michelogiannakis, and P. V. Gratz. 2019. SpecLock: Speculative lock forwarding. In Proceedings of the International Conference on Computer Design (ICCD’19). IEEE, 273--282.Google ScholarGoogle Scholar
  48. Vaibhav Mehta, Constantinos Bartzis, Haifeng Zhu, Edmund Clarke, and Jeannette Wing. 2006. Ranking attack graphs. In Proceedings of the International Workshop on Recent Advances in Intrusion Detection. Springer, 127--144.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Pierre Michaud. 2016. Best-offset hardware prefetching. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’16). IEEE, 469--480.Google ScholarGoogle ScholarCross RefCross Ref
  50. G. Michelogiannakis and J. Shalf. 2017. Last level collective hardware prefetching for data-parallel applications. In Proceedings of the International Conference on High Performance Computing (HiPC’17). IEEE, 72--83.Google ScholarGoogle Scholar
  51. Rada Mihalcea and Dragomir Radev. 2011. Graph-based Natural Language Processing and Information Retrieval. Cambridge University Press.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Todd C. Mowry, Monica S. Lam, and Anoop Gupta. 1992. Design and evaluation of a compiler algorithm for prefetching. ACM Sigplan Notices 27, 9, 62--73.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Anurag Mukkara, Nathan Beckmann, Maleen Abeydeera, Xiaosong Ma, and Daniel Sanchez. 2018. Exploiting locality in graph analytics through hardware-accelerated traversal scheduling. In Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’18). IEEE, 1--14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Richard C. Murphy, Kyle B. Wheeler, Brian W. Barrett, and James A. Ang. 2010. Introducing the graph 500. Cray Users Group (CUG).Google ScholarGoogle Scholar
  55. Lifeng Nai, Yinglong Xia, Ilie G. Tanase, Hyesoon Kim, and Ching-Yung Lin. 2015. GraphBIG: Understanding graph computing in the context of industrial solutions. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’15). IEEE, 1--12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Kyle J. Nesbit and James E. Smith. 2004. Data cache prefetching using a global history buffer. In Proceedings of the10th International Symposium on High Performance Computer Architecture (HPCA’04). IEEE, 96--96.Google ScholarGoogle Scholar
  57. Muhammet Mustafa Ozdal, Serif Yesil, Taemin Kim, Andrey Ayupov, John Greth, Steven Burns, and Ozcan Ozturk. 2016. Energy efficient architecture for graph analytics accelerators. ACM SIGARCH Comput. Architect. News 44, 3 (2016), 166--177.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Leeor Peled, Shie Mannor, Uri Weiser, and Yoav Etsion. 2015. Semantic locality and context-based prefetching using reinforcement learning. In Proceedings of the ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA’15). IEEE, 285--297.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Seth H. Pugsley, Zeshan Chishti, Chris Wilkerson, Peng-fei Chuang, Robert L. Scott, Aamer Jaleel, Shih-Lien Lu, Kingsum Chow, and Rajeev Balasubramonian. 2014. Sandbox prefetching: Safe run-time evaluation of aggressive prefetchers. In Proceedings of the IEEE 20th International Symposium on High Performance Computer Architecture (HPCA’14). IEEE, 626--637.Google ScholarGoogle ScholarCross RefCross Ref
  60. S. Ravi. 2016. Graph-powered Machine Learning at Google.Google ScholarGoogle Scholar
  61. Amir Roth, Andreas Moshovos, and Gurindar S. Sohi. 1998. Dependence based prefetching for linked data structures. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’98). 115--126.Google ScholarGoogle Scholar
  62. A. Roth and G. S. Sohi. 1999. Effective jump-pointer prefetching for linked data structures. In Proceedings of the 26th Annual International Symposium on Computer Architecture (ISCA’99). ACM, 111--121.Google ScholarGoogle Scholar
  63. Manjunath Shevgoor, Sahil Koladiya, Rajeev Balasubramonian, Chris Wilkerson, Seth H. Pugsley, and Zeshan Chishti. 2015. Efficiently prefetching complex address patterns. In Proceedings of the 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’15). IEEE, 141--152.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Julian Shun and Guy E. Blelloch. 2013. Ligra: A lightweight graph processing framework for shared memory. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 135--146.Google ScholarGoogle Scholar
  65. Stuart Staniford-Chen, Steven Cheung, Richard Crawford, Mark Dilger, Jeremy Frank, James Hoagland, Karl Levitt, Christopher Wee, Raymond Yip, and Dan Zerkle. 1996. GrIDS-a graph based intrusion detection system for large networks. In Proceedings of the 19th National Information Systems Security Conference, Vol. 1. 361--370.Google ScholarGoogle Scholar
  66. Xiangyao Yu, Christopher J. Hughes, Nadathur Satish, and Srinivas Devadas. 2015. IMP: Indirect memory prefetcher. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO’15). ACM, 178--190.Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Dan Zhang, Xiaoyu Ma, Michael Thomson, and Derek Chiou. 2018. Minnow: Lightweight offload engines for worklist management and worklist-directed prefetching. ACM SIGPLAN Notices 53, 2 (2018), 593--607.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Gretch: A Hardware Prefetcher for Graph Analytics

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Architecture and Code Optimization
        ACM Transactions on Architecture and Code Optimization  Volume 18, Issue 2
        June 2021
        190 pages
        ISSN:1544-3566
        EISSN:1544-3973
        DOI:10.1145/3450354
        Issue’s Table of Contents

        Copyright © 2021 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 9 February 2021
        • Revised: 1 November 2020
        • Accepted: 1 November 2020
        • Received: 1 January 2020
        Published in taco Volume 18, Issue 2

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format