Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open Access

Monolithically Integrating Non-Volatile Main Memory over the Last-Level Cache

Published:17 July 2021Publication History
Skip Abstract Section

Abstract

Many emerging non-volatile memories are compatible with CMOS logic, potentially enabling their integration into a CPU’s die. This article investigates such monolithically integrated CPU–main memory chips. We exploit non-volatile memories employing 3D crosspoint subarrays, such as resistive RAM (ReRAM), and integrate them over the CPU’s last-level cache (LLC). The regular structure of cache arrays enables co-design of the LLC and ReRAM main memory for area efficiency. We also develop a streamlined LLC/main memory interface that employs a single shared internal interconnect for both the cache and main memory arrays, and uses a unified controller to service both LLC and main memory requests.

We apply our monolithic design ideas to a many-core CPU by integrating 3D ReRAM over each core’s LLC slice. We find that co-design of the LLC and ReRAM saves 27% of the total LLC–main memory area at the expense of slight increases in delay and energy. The streamlined LLC/main memory interface saves an additional 12% in area. Our simulation results show monolithic integration of CPU and main memory improves performance by 5.3× and 1.7× over HBM2 DRAM for several graph and streaming kernels, respectively. It also reduces the memory system’s energy by 6.0× and 1.7×, respectively. Moreover, we show that the area savings of co-design permits the CPU to have 23% more cores and main memory, and that streamlining the LLC/main memory interface incurs a small 4% performance penalty.

References

  1. Neha Agarwal and Thomas F. Wenisch. 2017. Thermostat: Application-transparent page management for two-tiered main memory. In Proceedings of the International Symposium on Architectural Support for Programming Languages and Operating Systems.Google ScholarGoogle Scholar
  2. Masab Ahmad, Farrukh Jijaz, Qingchuan Shi, and Omer Khan. 2015. CRONO: A benchmark suite for multithreaded graph algorithms executing on futuristic multicores. In Proceedings of the 2015 IEEE International Symposium on Workload Characterization.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Mohamed M. Sabry Aly, Mingyu Gao, Gage Hills, Chi-Shuen Lee, Greg Pitner, Max M. Shulaker, Tony F. Wu, et al. 2015. Energy-efficient abundant-data computing: The N3XT . Computer 48, 12 (Dec. 2015), 24–33.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Mohamed M. Sabry Aly, Tony F. Wu, Andrew Bartolo, Yash H. Malviya, William Hwang, Gage Hills, Igor Markov, et al. 2019. The N3XT approach to energy-efficient abundant-data computing. Proceedings of the IEEE 107, 1 (Jan. 2019), 19–48.Google ScholarGoogle ScholarCross RefCross Ref
  5. David A. Bader, John Feo, John Gilbert, Jeremy Kepner, David Koester, Eugene Loh, Kamesh Madduri, Bill Mann, and Theresa Meuse. 2006. HPCS Scalable Synthetic Compact Applications #2 Graph Analysis. Retrieved May 28, 2021 from http://www.graphanalysis.org/benchmark/HPCS-SSCA2_Graph-Theory_v2.1.pdf.Google ScholarGoogle Scholar
  6. T. E. Carlson, W. Heirman, and L. Eeckhout. 2011. Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation. In Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’11). 1–12. DOI:https://doi.org/10.1145/2063384.2063454Google ScholarGoogle Scholar
  7. Trevor E. Carlson, Wim Heirman, Stijn Eyerman, Ibrahim Hur, and Lieven Eeckhout. 2014. An evaluation of high-level mechanistic core models. ACM Transactions on Architecture and Code Optimization 11, 3 (April 2014), Article 28.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Deepayan Chakrabarti, Yiping Zhan, and Christos Faloutsos. 2004. R-MAT: A recursive model for graph mining. In Proceedings of the 2004 SIAM International Conference on Data Mining.Google ScholarGoogle ScholarCross RefCross Ref
  9. Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Leon O. Chua. 1971. Memristor—The missing circuit element. IEEE Transactions on Circuit Theory 18, 5 (1971), 507–519. DOI:https://doi.org/doi:10.1109/TCT.1971.1083337Google ScholarGoogle ScholarCross RefCross Ref
  11. Crossbar. 2017. ReRAM Memory, Crossbar. https://www.crossbar-inc.com/assets/resources/white-papers/Crossbar-ReRAM-Technology.pdf.Google ScholarGoogle Scholar
  12. Crossbar. 2020. Personal communication.Google ScholarGoogle Scholar
  13. Ian Cutress. 2015. SuperComputing 15: Intel’s Knights Landing/Xeon Phi Silicon on Display. Retrieved May 28, 2021 from https://www.anandtech.com/show/9802/supercomputing-15-intels-knights-landing-xeon-phi-silicon-on-display.Google ScholarGoogle Scholar
  14. Gaurav Dhiman, Raid Ayoub, and Tajana Rosing. 2009. PDRAM: A hybrid PRAM and DRAM main memory system. In Proceedings of the Design Automation Conference.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Xiangyu Dong, Cong Xu, Yuan Xie, and Norman P. Jouppi. 2012. NVSim: A circuit-level performance, energy, and area model for emerging nonvolatile memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 31, 7 (July 2012), 994–1007.Google ScholarGoogle Scholar
  16. Subramanya R. Dulloor, Amitabha Roy, Zheguang Zhao, Narayanan Sundaram, Nadathur Satish, Rajesh Sankaran, Jeff Jackson, and Karsten Schwan. 2016. Data tiering in heterogeneous memory systems. In Proceedings of the 11th European Conference on Computer Systems.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Young-Ho Gong. 2021. Monolithic 3D-based SRAM/MRAM hybrid memory for an energy-efficient unified L2 TLB-cache architecture. IEEE Access 9 (2021), 18915–18926. DOI:https://doi.org/10.1109/ACCESS.2021.3054021Google ScholarGoogle ScholarCross RefCross Ref
  18. Anoop Gupta, Wolf Dietrich Weber, and Todd Mowry. 1990. Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. In Proceedings of the International Conference on Parallel Processing. 312–321.Google ScholarGoogle Scholar
  19. Charlie Demerjian. 2004. Sun’s Niagara falls neatly into multithreaded place. The Inquirer, 02 November 2004.Google ScholarGoogle Scholar
  20. Intel. 2012. Intel Software Development Emulator. Retrieved May 28, 2021 from http://software.intel.com/en-us/articles/intel-software-development-emulator.Google ScholarGoogle Scholar
  21. Intel. 2017. AVX 512 Instruction Extensions. Retrieved May 28, 2021 from http://software.intel.com/en-us/blogs/2013/avx-512-instructions.Google ScholarGoogle Scholar
  22. Intel. 2017. Intel Optane Technology. Retrieved May 28, 2021 from http://www.intel.com/content/www/us/en/architecture-and-technology/intel-optane-technology.html.Google ScholarGoogle Scholar
  23. Meenatchi Jagasivamani, Candace Walden, Devesh Singh, Luyi Kang, Mehdi Asnaashari, Sylvain Dubois, Bruce Jacob, and Donald Yeung. 2020. Tileable monolithic ReRAM memory design. In Proceedings of the IEEE Symposium on Low-Power and High-Speed Chips and Systems.Google ScholarGoogle ScholarCross RefCross Ref
  24. Meenatchi Jagasivamani, Candace Walden, Devesh Singh, Luyi Kang, Shang Li, Mehdi Asnaashari, Sylvain Dubois, Bruce Jacob, and Donald Yeung. 2019. Analyzing the monolithic integration of a ReRAM-based main memory into a CPU’s die. IEEE Micro 39, 6 (Nov.-Dec. 2019), 64–72.Google ScholarGoogle ScholarCross RefCross Ref
  25. Meenatchi Jagasivamani, Candace Walden, Devesh Singh, Luyi Kang, Shang Li, Mehdi Asnaashari, Sylvain Dubois, Donald Yeung, and Bruce Jacob. 2019. Design for ReRAM-based main-memory architectures. In Proceedings of the 5th International Symposium on Memory Systems.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Sung Hyun Jo, Kuk-Hwan Kim, and Wei Lu. 2009. High-density cross-bar arrays based on a Si memristive system. Nano Letters 9, 2 (2009), 870–874.Google ScholarGoogle ScholarCross RefCross Ref
  27. Sung Hyun Jo, T. Kumar, S. Narayanan, W. D. Lu, and H. Nazarian. 2014. 3D-stackable crossbar resistive memory based on field assisted superlinear threshold (FAST) selector. In Proceedings of the IEEE International Electron Devices Meeting.Google ScholarGoogle Scholar
  28. Doris Keitel-Schulz and Norbert Wehn. 2001. Embedded DRAM development: Technology, physical design, and application issues. IEEE Design & Test of Computers 18, 3 (May-June 2001), 7–15.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. John H. Kelm, Daniel R. Johnson, Matthew R. Johnson, Neal C. Crago, William Tuohy, Aqeel Mahesri, Steven S. Lumetta, Matthew I. Frank, and Sanjay J. Patel. 2009. Rigel: An architecture and scalable programming interface for a 1000-core accelerator. In Proceedings of the International Symposium on Computer Architecture. 140–151.Google ScholarGoogle Scholar
  30. Myoung-Jae Lee, Chang Bum Lee, Dongsoo Lee, Seung Ryul Lee, Man Chang, Ji Hyun Hur, Young-Bae Kim, et al. 2011. A fast, high-endurance and scalable non-volatile memory device made from asymmetric Ta2O5-x/TaO2-x bilayer structures. Nature Materials 10 (Aug. 2011), 625–630.Google ScholarGoogle Scholar
  31. Soyoon Lee, Hyokyung Bahn, and Sam H. Noh. 2014. CLOCK-DWF: A write-history-aware page replacement algorithm for hybrid PCM and DRAM memory architectures. IEEE Transactions on Computers 63, 9 (Sept. 2014), 2187–2200.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Sukhan Lee, HyunYoon Cho, Young Hoon Son, Yuhwan Ro, Nam Sung Kim, and Jung Ho Ahn. 2018. Leveraging power-performance relationship of energy-efficient modern DRAM devices. IEEE Access 6 (June 2018), 31387–31398.Google ScholarGoogle Scholar
  33. Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the International Symposium on Microarchitecture.Google ScholarGoogle Scholar
  34. Shang Li, Zhiyuan Yang, Dhiraj Reddy, Ankur Srivastava, and Bruce Jacob. 2019. DRAMsim3: A cycle-accurate, thermal capable memory system simulator. IEEE Computer Architecture Letters 19, 2 (2019), 106–109.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Jason E. Miller, Harshad Kasture, George Kurian, Charles Gruenwald III, Nathan Beckmann, Christopher Celio, Jonathan Eastep, and Anant Agarwal. 2010. Graphite: A distributed parallel simulator for multicores. In Proceedings of the 16th International Symposium on High-Performance Computer Architecture.Google ScholarGoogle ScholarCross RefCross Ref
  37. Naveen Muralimanohar, Rajeev Balasubramonian, and Norm Jouppi. 2007. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’07). IEEE, Los Alamitos, CA, 3–14. DOI:https://doi.org/10.1109/MICRO.2007.30Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Mike O’Connor, Niladrish Chatterjee, Donghyuk Lee, John Wilson, Aditya Agrawal, Stephen W. Keckler, and William J. Dally. 2017. Fine-grained DRAM: Energy-efficient DRAM for extreme bandwidth systems. In Proceedings of the 50th International Symposium on Microarchitecture.Google ScholarGoogle Scholar
  39. Moinuddin K. Qureshi, John Karidis, Michele Franceschini, Vijayalakshmi Srinivasan, Luis Lastras, and Bulent Abali. 2009. Enhancing lifetime and security of PCM-based main memory with start-gap wear leveling. In Proceedings of the 42nd Annual International Symposium on Microarchitecture.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Moinuddin K. Qureshi, Vijayalakshmi, and Jude A. Rivers. 2009. Scalable high performance main memory system using phase-change memory technology. In Proceedings of the International Symposium on Computer Architecture.Google ScholarGoogle Scholar
  41. Luiz Ramos, Eugene Gorbatov, and Ricardo Bianchini. 2011. Page placement in hybrid memory systems. In Proceedings of the 2011 International Conference on Supercomputing.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Parthasarathy Ranganathan. 2011. From microprocessors to nanostores: Rethinking data-centric systems. Computer 44, 1 (Jan. 2011), 39–48.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Daniel Sanchez and Christos Kozyrakis. 2013. ZSim: Fast and accurate microarchitectural simulation of thousand-core systems. In Proceedings of the 40th International Symposium on Computer Architecture.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Larry Seiler, Doug Carmean, Eric Sprangle, Tom Forsyth, Michael Abrash, Pradeep Dubey, Stephen Junkins, et al. 2008. Larrabee: A many-core architecture for visual computing. ACM Transactions on Graphics 27, 3 (Aug. 2008), 1–16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Shyamkumar Thoziyoor, Naveen Muralimanohar, Jung Ho Ahn, and Norman Jouppi. 2008. CACTI 5.1. Technical Report. HP Laboratories.Google ScholarGoogle Scholar
  46. Dmitrii Ustiugov, Alexandros Daglis, Javier Picorel, Mark Sutherland, Edouard Bugnion, Babak Falsafi, and Dionisios Pnevmatikatos. 2018. Design guidelines for high-performance SCM hierarchies. In Proceedings of the 4th International Symposium on Memory Systems.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Cong Xu, Dimin Niu, Naveen Muralimanohar, Rajeev Balasubramonian, Tao Zhang, Shimeng Yu, and Yuan Xie. 2015. Overcoming the challenges of crossbar resistive memory architectures. In Proceedings of the International Symposium on High Performance Computer Architecture.Google ScholarGoogle ScholarCross RefCross Ref
  48. D. H. Yoon, M. K. Jeong, and M. Erez. 2011. Adaptive granularity memory systems: A tradeoff between storage efficiency and throughput. In Proceedings of the 2011 38th Annual International Symposium on Computer Architecture (ISCA’11). 295–306.Google ScholarGoogle Scholar
  49. Lunkay Zhang, Brian Neely, Diana Franklin, Dmitri Strukov, Yuan Xie, and Frederic T. Chong. 2016. Mellow writes: Extending lifetime in resistive memories through selective slow write backs. In Proceedings of the 43rd International Symposium on Computer Architecture.Google ScholarGoogle Scholar
  50. Wangyuan Zhang and Tao Li. 2009. Exploring phase change memory and 3D die-stacking for power/thermal friendly, fast and durable memory architectures. In Proceedings of the International Symposium on Parallel Architectures and Compilation Techniques.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Monolithically Integrating Non-Volatile Main Memory over the Last-Level Cache

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Architecture and Code Optimization
        ACM Transactions on Architecture and Code Optimization  Volume 18, Issue 4
        December 2021
        497 pages
        ISSN:1544-3566
        EISSN:1544-3973
        DOI:10.1145/3476575
        Issue’s Table of Contents

        Copyright © 2021 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 17 July 2021
        • Revised: 1 April 2021
        • Accepted: 1 April 2021
        • Received: 1 October 2020
        Published in taco Volume 18, Issue 4

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format