Monolithically Integrating Non-Volatile Main Memory over the Last-Level Cache

Authors:
Candace Walden

University of Maryland, College Park

University of Maryland, College Park
View Profile

,
Devesh Singh

University of Maryland, College Park

University of Maryland, College Park
View Profile

,
Meenatchi Jagasivamani

University of Maryland, College Park

University of Maryland, College Park
View Profile

,
Shang Li

University of Maryland, College Park

University of Maryland, College Park
View Profile

,
Luyi Kang

University of Maryland, College Park

University of Maryland, College Park
View Profile

,
Mehdi Asnaashari

Crossbar Inc.

Crossbar Inc.
View Profile

,
Sylvain Dubois

Crossbar Inc.

Crossbar Inc.
View Profile

,
Bruce Jacob

University of Maryland, College Park

University of Maryland, College Park
View Profile

,
Donald Yeung

University of Maryland, College Park

University of Maryland, College Park
View Profile

ACM Transactions on Architecture and Code Optimization Volume 18 Issue 4Article No.: 48pp 1–26https://doi.org/10.1145/3462632

Published:17 July 2021Publication History

ACM Transactions on Architecture and Code Optimization

Abstract

Many emerging non-volatile memories are compatible with CMOS logic, potentially enabling their integration into a CPU’s die. This article investigates such monolithically integrated CPU–main memory chips. We exploit non-volatile memories employing 3D crosspoint subarrays, such as resistive RAM (ReRAM), and integrate them over the CPU’s last-level cache (LLC). The regular structure of cache arrays enables co-design of the LLC and ReRAM main memory for area efficiency. We also develop a streamlined LLC/main memory interface that employs a single shared internal interconnect for both the cache and main memory arrays, and uses a unified controller to service both LLC and main memory requests.

We apply our monolithic design ideas to a many-core CPU by integrating 3D ReRAM over each core’s LLC slice. We find that co-design of the LLC and ReRAM saves 27% of the total LLC–main memory area at the expense of slight increases in delay and energy. The streamlined LLC/main memory interface saves an additional 12% in area. Our simulation results show monolithic integration of CPU and main memory improves performance by 5.3× and 1.7× over HBM2 DRAM for several graph and streaming kernels, respectively. It also reduces the memory system’s energy by 6.0× and 1.7×, respectively. Moreover, we show that the area savings of co-design permits the CPU to have 23% more cores and main memory, and that streamlining the LLC/main memory interface incurs a small 4% performance penalty.

References

Neha Agarwal and Thomas F. Wenisch. 2017. Thermostat: Application-transparent page management for two-tiered main memory. In Proceedings of the International Symposium on Architectural Support for Programming Languages and Operating Systems.Google Scholar
Masab Ahmad, Farrukh Jijaz, Qingchuan Shi, and Omer Khan. 2015. CRONO: A benchmark suite for multithreaded graph algorithms executing on futuristic multicores. In Proceedings of the 2015 IEEE International Symposium on Workload Characterization.Google ScholarDigital Library
Mohamed M. Sabry Aly, Mingyu Gao, Gage Hills, Chi-Shuen Lee, Greg Pitner, Max M. Shulaker, Tony F. Wu, et al. 2015. Energy-efficient abundant-data computing: The N3XT . Computer 48, 12 (Dec. 2015), 24–33.Google ScholarDigital Library
Mohamed M. Sabry Aly, Tony F. Wu, Andrew Bartolo, Yash H. Malviya, William Hwang, Gage Hills, Igor Markov, et al. 2019. The N3XT approach to energy-efficient abundant-data computing. Proceedings of the IEEE 107, 1 (Jan. 2019), 19–48.Google ScholarCross Ref
David A. Bader, John Feo, John Gilbert, Jeremy Kepner, David Koester, Eugene Loh, Kamesh Madduri, Bill Mann, and Theresa Meuse. 2006. HPCS Scalable Synthetic Compact Applications #2 Graph Analysis. Retrieved May 28, 2021 from http://www.graphanalysis.org/benchmark/HPCS-SSCA2_Graph-Theory_v2.1.pdf.Google Scholar
T. E. Carlson, W. Heirman, and L. Eeckhout. 2011. Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation. In Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’11). 1–12. DOI:https://doi.org/10.1145/2063384.2063454Google Scholar
Trevor E. Carlson, Wim Heirman, Stijn Eyerman, Ibrahim Hur, and Lieven Eeckhout. 2014. An evaluation of high-level mechanistic core models. ACM Transactions on Architecture and Code Optimization 11, 3 (April 2014), Article 28.Google ScholarDigital Library
Deepayan Chakrabarti, Yiping Zhan, and Christos Faloutsos. 2004. R-MAT: A recursive model for graph mining. In Proceedings of the 2004 SIAM International Conference on Data Mining.Google ScholarCross Ref
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization.Google ScholarDigital Library
Leon O. Chua. 1971. Memristor—The missing circuit element. IEEE Transactions on Circuit Theory 18, 5 (1971), 507–519. DOI:https://doi.org/doi:10.1109/TCT.1971.1083337Google ScholarCross Ref
Crossbar. 2017. ReRAM Memory, Crossbar. https://www.crossbar-inc.com/assets/resources/white-papers/Crossbar-ReRAM-Technology.pdf.Google Scholar
Crossbar. 2020. Personal communication.Google Scholar
Ian Cutress. 2015. SuperComputing 15: Intel’s Knights Landing/Xeon Phi Silicon on Display. Retrieved May 28, 2021 from https://www.anandtech.com/show/9802/supercomputing-15-intels-knights-landing-xeon-phi-silicon-on-display.Google Scholar
Gaurav Dhiman, Raid Ayoub, and Tajana Rosing. 2009. PDRAM: A hybrid PRAM and DRAM main memory system. In Proceedings of the Design Automation Conference.Google ScholarDigital Library
Xiangyu Dong, Cong Xu, Yuan Xie, and Norman P. Jouppi. 2012. NVSim: A circuit-level performance, energy, and area model for emerging nonvolatile memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 31, 7 (July 2012), 994–1007.Google Scholar
Subramanya R. Dulloor, Amitabha Roy, Zheguang Zhao, Narayanan Sundaram, Nadathur Satish, Rajesh Sankaran, Jeff Jackson, and Karsten Schwan. 2016. Data tiering in heterogeneous memory systems. In Proceedings of the 11th European Conference on Computer Systems.Google ScholarDigital Library
Young-Ho Gong. 2021. Monolithic 3D-based SRAM/MRAM hybrid memory for an energy-efficient unified L2 TLB-cache architecture. IEEE Access 9 (2021), 18915–18926. DOI:https://doi.org/10.1109/ACCESS.2021.3054021Google ScholarCross Ref
Anoop Gupta, Wolf Dietrich Weber, and Todd Mowry. 1990. Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. In Proceedings of the International Conference on Parallel Processing. 312–321.Google Scholar
Charlie Demerjian. 2004. Sun’s Niagara falls neatly into multithreaded place. The Inquirer, 02 November 2004.Google Scholar
Intel. 2012. Intel Software Development Emulator. Retrieved May 28, 2021 from http://software.intel.com/en-us/articles/intel-software-development-emulator.Google Scholar
Intel. 2017. AVX 512 Instruction Extensions. Retrieved May 28, 2021 from http://software.intel.com/en-us/blogs/2013/avx-512-instructions.Google Scholar
Intel. 2017. Intel Optane Technology. Retrieved May 28, 2021 from http://www.intel.com/content/www/us/en/architecture-and-technology/intel-optane-technology.html.Google Scholar
Meenatchi Jagasivamani, Candace Walden, Devesh Singh, Luyi Kang, Mehdi Asnaashari, Sylvain Dubois, Bruce Jacob, and Donald Yeung. 2020. Tileable monolithic ReRAM memory design. In Proceedings of the IEEE Symposium on Low-Power and High-Speed Chips and Systems.Google ScholarCross Ref
Meenatchi Jagasivamani, Candace Walden, Devesh Singh, Luyi Kang, Shang Li, Mehdi Asnaashari, Sylvain Dubois, Bruce Jacob, and Donald Yeung. 2019. Analyzing the monolithic integration of a ReRAM-based main memory into a CPU’s die. IEEE Micro 39, 6 (Nov.-Dec. 2019), 64–72.Google ScholarCross Ref
Meenatchi Jagasivamani, Candace Walden, Devesh Singh, Luyi Kang, Shang Li, Mehdi Asnaashari, Sylvain Dubois, Donald Yeung, and Bruce Jacob. 2019. Design for ReRAM-based main-memory architectures. In Proceedings of the 5th International Symposium on Memory Systems.Google ScholarDigital Library
Sung Hyun Jo, Kuk-Hwan Kim, and Wei Lu. 2009. High-density cross-bar arrays based on a Si memristive system. Nano Letters 9, 2 (2009), 870–874.Google ScholarCross Ref
Sung Hyun Jo, T. Kumar, S. Narayanan, W. D. Lu, and H. Nazarian. 2014. 3D-stackable crossbar resistive memory based on field assisted superlinear threshold (FAST) selector. In Proceedings of the IEEE International Electron Devices Meeting.Google Scholar
Doris Keitel-Schulz and Norbert Wehn. 2001. Embedded DRAM development: Technology, physical design, and application issues. IEEE Design & Test of Computers 18, 3 (May-June 2001), 7–15.Google ScholarDigital Library
John H. Kelm, Daniel R. Johnson, Matthew R. Johnson, Neal C. Crago, William Tuohy, Aqeel Mahesri, Steven S. Lumetta, Matthew I. Frank, and Sanjay J. Patel. 2009. Rigel: An architecture and scalable programming interface for a 1000-core accelerator. In Proceedings of the International Symposium on Computer Architecture. 140–151.Google Scholar
Myoung-Jae Lee, Chang Bum Lee, Dongsoo Lee, Seung Ryul Lee, Man Chang, Ji Hyun Hur, Young-Bae Kim, et al. 2011. A fast, high-endurance and scalable non-volatile memory device made from asymmetric Ta2O5-x/TaO2-x bilayer structures. Nature Materials 10 (Aug. 2011), 625–630.Google Scholar
Soyoon Lee, Hyokyung Bahn, and Sam H. Noh. 2014. CLOCK-DWF: A write-history-aware page replacement algorithm for hybrid PCM and DRAM memory architectures. IEEE Transactions on Computers 63, 9 (Sept. 2014), 2187–2200.Google ScholarDigital Library
Sukhan Lee, HyunYoon Cho, Young Hoon Son, Yuhwan Ro, Nam Sung Kim, and Jung Ho Ahn. 2018. Leveraging power-performance relationship of energy-efficient modern DRAM devices. IEEE Access 6 (June 2018), 31387–31398.Google Scholar
Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the International Symposium on Microarchitecture.Google Scholar
Shang Li, Zhiyuan Yang, Dhiraj Reddy, Ankur Srivastava, and Bruce Jacob. 2019. DRAMsim3: A cycle-accurate, thermal capable memory system simulator. IEEE Computer Architecture Letters 19, 2 (2019), 106–109.Google ScholarDigital Library
Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation.Google ScholarDigital Library
Jason E. Miller, Harshad Kasture, George Kurian, Charles Gruenwald III, Nathan Beckmann, Christopher Celio, Jonathan Eastep, and Anant Agarwal. 2010. Graphite: A distributed parallel simulator for multicores. In Proceedings of the 16th International Symposium on High-Performance Computer Architecture.Google ScholarCross Ref
Naveen Muralimanohar, Rajeev Balasubramonian, and Norm Jouppi. 2007. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’07). IEEE, Los Alamitos, CA, 3–14. DOI:https://doi.org/10.1109/MICRO.2007.30Google ScholarDigital Library
Mike O’Connor, Niladrish Chatterjee, Donghyuk Lee, John Wilson, Aditya Agrawal, Stephen W. Keckler, and William J. Dally. 2017. Fine-grained DRAM: Energy-efficient DRAM for extreme bandwidth systems. In Proceedings of the 50th International Symposium on Microarchitecture.Google Scholar
Moinuddin K. Qureshi, John Karidis, Michele Franceschini, Vijayalakshmi Srinivasan, Luis Lastras, and Bulent Abali. 2009. Enhancing lifetime and security of PCM-based main memory with start-gap wear leveling. In Proceedings of the 42nd Annual International Symposium on Microarchitecture.Google ScholarDigital Library
Moinuddin K. Qureshi, Vijayalakshmi, and Jude A. Rivers. 2009. Scalable high performance main memory system using phase-change memory technology. In Proceedings of the International Symposium on Computer Architecture.Google Scholar
Luiz Ramos, Eugene Gorbatov, and Ricardo Bianchini. 2011. Page placement in hybrid memory systems. In Proceedings of the 2011 International Conference on Supercomputing.Google ScholarDigital Library
Parthasarathy Ranganathan. 2011. From microprocessors to nanostores: Rethinking data-centric systems. Computer 44, 1 (Jan. 2011), 39–48.Google ScholarDigital Library
Daniel Sanchez and Christos Kozyrakis. 2013. ZSim: Fast and accurate microarchitectural simulation of thousand-core systems. In Proceedings of the 40th International Symposium on Computer Architecture.Google ScholarDigital Library
Larry Seiler, Doug Carmean, Eric Sprangle, Tom Forsyth, Michael Abrash, Pradeep Dubey, Stephen Junkins, et al. 2008. Larrabee: A many-core architecture for visual computing. ACM Transactions on Graphics 27, 3 (Aug. 2008), 1–16.Google ScholarDigital Library
Shyamkumar Thoziyoor, Naveen Muralimanohar, Jung Ho Ahn, and Norman Jouppi. 2008. CACTI 5.1. Technical Report. HP Laboratories.Google Scholar
Dmitrii Ustiugov, Alexandros Daglis, Javier Picorel, Mark Sutherland, Edouard Bugnion, Babak Falsafi, and Dionisios Pnevmatikatos. 2018. Design guidelines for high-performance SCM hierarchies. In Proceedings of the 4th International Symposium on Memory Systems.Google ScholarDigital Library
Cong Xu, Dimin Niu, Naveen Muralimanohar, Rajeev Balasubramonian, Tao Zhang, Shimeng Yu, and Yuan Xie. 2015. Overcoming the challenges of crossbar resistive memory architectures. In Proceedings of the International Symposium on High Performance Computer Architecture.Google ScholarCross Ref
D. H. Yoon, M. K. Jeong, and M. Erez. 2011. Adaptive granularity memory systems: A tradeoff between storage efficiency and throughput. In Proceedings of the 2011 38th Annual International Symposium on Computer Architecture (ISCA’11). 295–306.Google Scholar
Lunkay Zhang, Brian Neely, Diana Franklin, Dmitri Strukov, Yuan Xie, and Frederic T. Chong. 2016. Mellow writes: Extending lifetime in resistive memories through selective slow write backs. In Proceedings of the 43rd International Symposium on Computer Architecture.Google Scholar
Wangyuan Zhang and Tao Li. 2009. Exploring phase change memory and 3D die-stacking for power/thermal friendly, fast and durable memory architectures. In Proceedings of the International Symposium on Parallel Architectures and Compilation Techniques.Google ScholarDigital Library

Index Terms

Monolithically Integrating Non-Volatile Main Memory over the Last-Level Cache
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multicore architectures
2. Hardware
  1. Emerging technologies
    1. Memory and dense storage

Recommendations

Density tradeoffs of non-volatile memory as a replacement for SRAM based last level cache
ISCA '18: Proceedings of the 45th Annual International Symposium on Computer Architecture

Increasing the capacity of the Last Level Cache (LLC) can help scale the memory wall. Due to prohibitive area and leakage power, however, growing conventional SRAM LLC already incurs diminishing returns. Emerging Non-Volatile Memory (NVM) technologies ...
Read More
Shared Last-Level Cache Management and Memory Scheduling for GPGPUs with Hybrid Main Memory

Memory intensive workloads become increasingly popular on general purpose graphics processing units (GPGPUs), and impose great challenges on the GPGPU memory subsystem design. On the other hand, with the recent development of non-volatile memory (NVM) ...
Read More
Redesign the Memory Allocator for Non-Volatile Main Memory
Special Issue on Hardware and Algorithms for Learning On-a-chip and Special Issue on Alternative Computing Systems

The non-volatile memory (NVM) has the merits of byte-addressability, fast speed, persistency and low power consumption, which make it attractive to be used as main memory. Commonly, user process dynamically acquires memory through memory allocators. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Architecture and Code Optimization Volume 18, Issue 4
December 2021
497 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3476575
Editor:
David Kaeli
Northeastern University, USA
Issue’s Table of Contents
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 July 2021
- Revised: 1 April 2021
- Accepted: 1 April 2021
- Received: 1 October 2020
Published in taco Volume 18, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Crosspoint architectures
ReRAM
on-die main memory systems
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 1,109
  Total Downloads
- Downloads (Last 12 months)264
- Downloads (Last 6 weeks)36
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Monolithically Integrating Non-Volatile Main Memory over the Last-Level Cache

ACM Transactions on Architecture and Code Optimization

Abstract

References

Cited By

Index Terms

Recommendations

Density tradeoffs of non-volatile memory as a replacement for SRAM based last level cache

Shared Last-Level Cache Management and Memory Scheduling for GPGPUs with Hybrid Main Memory

Redesign the Memory Allocator for Non-Volatile Main Memory