Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3472456.3472485acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article

Coupling Right-Provisioned Cold Storage Data Centers with Deduplication

Published:05 October 2021Publication History

ABSTRACT

Modern cloud-scale cold storage data centers have begun to support right-provisioning of a rack’s resources (power, cooling, etc.), which allows only a small fraction of all hard disks to be active (spinning) concurrently at any given time to reduce the cost of ownership. Data deduplication is a traditional approach to split files into chunks and eliminate duplicate chunks, which can also cut costs for cold storage systems. However, when combined with right-provisioning, classical deduplication may make a file deduplicated and stored across the disks some of which are not active currently, thus leading to unacceptable access performance caused by spinning up and down of the disks.

In this paper, we analyze the deduplication ratio under real-world workloads of cloud cold storage and observe for most workloads: 1) the deduplication ratio generally increases quickly with the first few of versions of the workload, and 2) increases slowly but steadily with the subsequent versions as a long tail. Based on the first observation, we propose an online deduplication way that can improve the deduplication ratio while providing acceptable read performance; based on the second one, we propose an additional offline deduplication way that can achieve comparable deduplication ratios with classical deduplication. We design a cold storage system called DeCold via combining the above two deduplication ways as well as improving deduplication efficiency. We prototype DeCold and conduct testbed experiments on real-world datasets including source code, virtual machine and database. Evaluations show that DeCold achieves better file access performance over the classical deduplication implementation, while maintaining decent deduplication efficiency.

References

  1. [n.d.]. GCC source code. http://ftp.gnu.org/gnu/gcc/.Google ScholarGoogle Scholar
  2. [n.d.]. Linux Kernel. http://www.kernel.org/.Google ScholarGoogle Scholar
  3. [n.d.]. Microsoft Azure Cool Blob Storage. https://azure.microsoft.com/en-us/blog/introducing-azure-cool-storage.Google ScholarGoogle Scholar
  4. [n.d.]. Redis. https://redis.io/.Google ScholarGoogle Scholar
  5. 2012. Amazon glacier. http://aws.amazon.com/glacier/.Google ScholarGoogle Scholar
  6. 2016. Opendedup. http://www.opendedup.org/.Google ScholarGoogle Scholar
  7. Samer Al-Kiswany, Dinesh Subhraveti, Prasenjit Sarkar, and Matei Ripeanu. 2011. VMFlock: virtual machine co-migration for the cloud. In Proc. of ACM HPDC.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. George Amvrosiadis and Medha Bhadkamkar. 2015. Identifying trends in enterprise data protection systems. In Proc. of USENIX ATC.Google ScholarGoogle Scholar
  9. Shobana Balakrishnan, Richard Black, Austin Donnelly, Paul England, Adam Glass, Dave Harper, Sergey Legtchenko, Aaron Ogus, Eric Peterson, and Antony Rowstron. 2014. Pelican: A building block for exascale cold data storage. In Proc. of USENIX OSDI.Google ScholarGoogle Scholar
  10. Richard Black, Austin Donnelly, Dave Harper, Aaron Ogus, and Anthony Rowstron. 2016. Feeding the pelican: Using archival hard drives for cold storage racks. In Proc. of USENIX HotStorage.Google ScholarGoogle Scholar
  11. Renata Borovica-Gajić, Raja Appuswamy, and Anastasia Ailamaki. 2016. Cheap data analytics using cold storage devices. In Proc. of VLDB Endowment.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Wenxiang Chen, Yuchong Hu, Siyang Yin, and Wen Xia. 2017. EEC-Dedup: Efficient Erasure-Coded Deduplicated Backup Storage Systems. In Proc. of IEEE ISPA. 251–258.Google ScholarGoogle ScholarCross RefCross Ref
  13. Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proc. of ACM SoCC. 143–154.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Cezary Dubnicki, Leszek Gryz, Lukasz Heldt, Michal Kaczmarczyk, Wojciech Kilian, Przemyslaw Strzelczak, Jerzy Szczepkowski, Cristian Ungureanu, and Michal Welnicki. 2009. HYDRAstor: A scalable secondary storage. In Proc. of USENIX FAST.Google ScholarGoogle Scholar
  15. Min Fu, Dan Feng, Yu Hua, Xubin He, Zuoning Chen, Wen Xia, Fangting Huang, and Qing Liu. 2014. Accelerating restore and garbage collection in deduplication-based backup systems via exploiting historical information. In Proc. of USENIX ATC.Google ScholarGoogle Scholar
  16. Min Fu, Dan Feng, Yu Hua, Xubin He, Zuoning Chen, Wen Xia, Yucheng Zhang, and Yujuan Tan. 2015. Design tradeoffs for data deduplication performance in backup workloads. In Proc. of USENIX FAST. 331–344.Google ScholarGoogle Scholar
  17. Sergey Legtchenko, Xiaozhou Li, Antony Rowstron, Austin Donnelly, and Richard Black. 2016. Flamingo: Enabling evolvable hdd-based near-line storage. In Proc. of USENIX FAST.Google ScholarGoogle Scholar
  18. Mark Lillibridge, Kave Eshghi, and Deepavali Bhagwat. 2013. Improving restore speed for backup systems that use inline chunk-based deduplication. In Proc. of USENIX FAST.Google ScholarGoogle Scholar
  19. Mark Lillibridge, Kave Eshghi, Deepavali Bhagwat, Vinay Deolalikar, Greg Trezis, and Peter Camble. 2009. Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality. In Proc. of USENIX FAST.Google ScholarGoogle Scholar
  20. Chuanyi Liu, Yu Gu, Linchun Sun, Bin Yan, and Dongsheng Wang. 2009. R-admad: High reliability provision for large-scale de-duplication archival storage systems.. In Proc. of ACM ICS.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Jinwei Liu and Haiying Shen. 2016. A popularity-aware cost-effective replication scheme for high data durability in cloud storage. In Proc. of IEEE Big Data.Google ScholarGoogle ScholarCross RefCross Ref
  22. Mengting Lu, Fang Wang, Dan Feng, and Yuchong Hu. 2019. A Read-leveling Data Distribution Scheme for Promoting Read Performance in SSDs with Deduplication. In Proc. of ICPP.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. Mendoza. 2013. Cold storage in the cloud: Trends, challenges, and solutions. Intel, White paper (2013).Google ScholarGoogle Scholar
  24. T. P MORGAN. October. Facebook loads up innovative cold storage datacenter. https://cloud.google.com/files/.Google ScholarGoogle Scholar
  25. Athicha Muthitacharoen, Benjie Chen, and David Mazieres. 2001. A low-bandwidth network file system. In Proc. of ACM SOSP.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. P. NEWSON. 2015. Whitepaper: Google cloud storage nearline. https://cloud.google.com/files/ GoogleCloudStorageNearline.pdf (2015).Google ScholarGoogle Scholar
  27. Dorward S Quinlan S. 2002. Venti: A new approach to archival storage. In Proc. of USENIX FAST.Google ScholarGoogle Scholar
  28. M. Rabin. 1981. Fingerprinting by random polynomials.Google ScholarGoogle Scholar
  29. I. Reed and G. Solomon. 1960. Polynomial Codes over Certain Finite Fields. Journal of the Society for Industrial & Applied Mathematics 8, 2(1960), 300–304.Google ScholarGoogle ScholarCross RefCross Ref
  30. Russ Cox Rhea, Sean C. and Alex Pesterev. 2008. Fast, Inexpensive Content-Addressed Storage in Foundation. In Proc. of USENIX FAST.Google ScholarGoogle Scholar
  31. Vasily Tarasov, Deepak Jain, Geoff Kuenning, Sonam Mandal, Karthikeyani Palanisami, Philip Shilane, Sagar Trehan, and Erez Zadok. 2014. Dmdedup: Device Mapper Target for Data Deduplication. In Proc. of Ottawa Linux Symposium (OSL).Google ScholarGoogle Scholar
  32. Michael Vrable, Stefan Savage, and Geoffrey M Voelker. 2009. Cumulus: Filesystem backup to the cloud. ACM Trans. on Storage 5, 4 (2009), 14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Carl A Waldspurger. 2002. Memory resource management in VMware ESX server. Proc. of ACM SIGOPS Operating Systems Review 36, SI (2002).Google ScholarGoogle ScholarCross RefCross Ref
  34. Grant Wallace, Fred Douglis, Hangwei Qian, Philip Shilane, Stephen Smaldone, Mark Chamness, and Windsor Hsu. 2012. Characteristics of backup workloads in production systems. In Proc. of USENIX FAST.Google ScholarGoogle Scholar
  35. Wen Xia, Hong Jiang, Dan Feng, and Yu Hua. 2011. SiLo: A Similarity-Locality based Near-Exact Deduplication Scheme with Low RAM Overhead and High Throughput.. In Proc. of USENIX ATC. 26–30.Google ScholarGoogle Scholar
  36. Wenrui Yan, Jie Yao, Qiang Cao, Changsheng Xie, and Hong Jiang. 2018. Ros: A rack-based optical storage system with inline accessibility for long-term data preservation. ACM Trans. on Storage 14, 3 (2018), 28.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Yucheng Zhang, Hong Jiang, Dan Feng, Wen Xia, Min Fu, Fangting Huang, and Yukun Zhou. 2015. AE: An asymmetric extremum content defined chunking algorithm for fast and bandwidth-efficient data deduplication. In Proc. of IEEE INFOCOM.Google ScholarGoogle ScholarCross RefCross Ref
  38. Benjamin Zhu, Kai Li, and R Hugo Patterson. 2008. Avoiding the Disk Bottleneck in the Data Domain Deduplication File System.. In Proc. of USENIX FAST.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    ICPP '21: Proceedings of the 50th International Conference on Parallel Processing
    August 2021
    927 pages
    ISBN:9781450390682
    DOI:10.1145/3472456

    Copyright © 2021 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 5 October 2021

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

    Acceptance Rates

    Overall Acceptance Rate91of313submissions,29%
  • Article Metrics

    • Downloads (Last 12 months)53
    • Downloads (Last 6 weeks)8

    Other Metrics

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format