ABSTRACT
Modern cloud-scale cold storage data centers have begun to support right-provisioning of a rack’s resources (power, cooling, etc.), which allows only a small fraction of all hard disks to be active (spinning) concurrently at any given time to reduce the cost of ownership. Data deduplication is a traditional approach to split files into chunks and eliminate duplicate chunks, which can also cut costs for cold storage systems. However, when combined with right-provisioning, classical deduplication may make a file deduplicated and stored across the disks some of which are not active currently, thus leading to unacceptable access performance caused by spinning up and down of the disks.
In this paper, we analyze the deduplication ratio under real-world workloads of cloud cold storage and observe for most workloads: 1) the deduplication ratio generally increases quickly with the first few of versions of the workload, and 2) increases slowly but steadily with the subsequent versions as a long tail. Based on the first observation, we propose an online deduplication way that can improve the deduplication ratio while providing acceptable read performance; based on the second one, we propose an additional offline deduplication way that can achieve comparable deduplication ratios with classical deduplication. We design a cold storage system called DeCold via combining the above two deduplication ways as well as improving deduplication efficiency. We prototype DeCold and conduct testbed experiments on real-world datasets including source code, virtual machine and database. Evaluations show that DeCold achieves better file access performance over the classical deduplication implementation, while maintaining decent deduplication efficiency.
- [n.d.]. GCC source code. http://ftp.gnu.org/gnu/gcc/.Google Scholar
- [n.d.]. Linux Kernel. http://www.kernel.org/.Google Scholar
- [n.d.]. Microsoft Azure Cool Blob Storage. https://azure.microsoft.com/en-us/blog/introducing-azure-cool-storage.Google Scholar
- [n.d.]. Redis. https://redis.io/.Google Scholar
- 2012. Amazon glacier. http://aws.amazon.com/glacier/.Google Scholar
- 2016. Opendedup. http://www.opendedup.org/.Google Scholar
- Samer Al-Kiswany, Dinesh Subhraveti, Prasenjit Sarkar, and Matei Ripeanu. 2011. VMFlock: virtual machine co-migration for the cloud. In Proc. of ACM HPDC.Google ScholarDigital Library
- George Amvrosiadis and Medha Bhadkamkar. 2015. Identifying trends in enterprise data protection systems. In Proc. of USENIX ATC.Google Scholar
- Shobana Balakrishnan, Richard Black, Austin Donnelly, Paul England, Adam Glass, Dave Harper, Sergey Legtchenko, Aaron Ogus, Eric Peterson, and Antony Rowstron. 2014. Pelican: A building block for exascale cold data storage. In Proc. of USENIX OSDI.Google Scholar
- Richard Black, Austin Donnelly, Dave Harper, Aaron Ogus, and Anthony Rowstron. 2016. Feeding the pelican: Using archival hard drives for cold storage racks. In Proc. of USENIX HotStorage.Google Scholar
- Renata Borovica-Gajić, Raja Appuswamy, and Anastasia Ailamaki. 2016. Cheap data analytics using cold storage devices. In Proc. of VLDB Endowment.Google ScholarDigital Library
- Wenxiang Chen, Yuchong Hu, Siyang Yin, and Wen Xia. 2017. EEC-Dedup: Efficient Erasure-Coded Deduplicated Backup Storage Systems. In Proc. of IEEE ISPA. 251–258.Google ScholarCross Ref
- Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proc. of ACM SoCC. 143–154.Google ScholarDigital Library
- Cezary Dubnicki, Leszek Gryz, Lukasz Heldt, Michal Kaczmarczyk, Wojciech Kilian, Przemyslaw Strzelczak, Jerzy Szczepkowski, Cristian Ungureanu, and Michal Welnicki. 2009. HYDRAstor: A scalable secondary storage. In Proc. of USENIX FAST.Google Scholar
- Min Fu, Dan Feng, Yu Hua, Xubin He, Zuoning Chen, Wen Xia, Fangting Huang, and Qing Liu. 2014. Accelerating restore and garbage collection in deduplication-based backup systems via exploiting historical information. In Proc. of USENIX ATC.Google Scholar
- Min Fu, Dan Feng, Yu Hua, Xubin He, Zuoning Chen, Wen Xia, Yucheng Zhang, and Yujuan Tan. 2015. Design tradeoffs for data deduplication performance in backup workloads. In Proc. of USENIX FAST. 331–344.Google Scholar
- Sergey Legtchenko, Xiaozhou Li, Antony Rowstron, Austin Donnelly, and Richard Black. 2016. Flamingo: Enabling evolvable hdd-based near-line storage. In Proc. of USENIX FAST.Google Scholar
- Mark Lillibridge, Kave Eshghi, and Deepavali Bhagwat. 2013. Improving restore speed for backup systems that use inline chunk-based deduplication. In Proc. of USENIX FAST.Google Scholar
- Mark Lillibridge, Kave Eshghi, Deepavali Bhagwat, Vinay Deolalikar, Greg Trezis, and Peter Camble. 2009. Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality. In Proc. of USENIX FAST.Google Scholar
- Chuanyi Liu, Yu Gu, Linchun Sun, Bin Yan, and Dongsheng Wang. 2009. R-admad: High reliability provision for large-scale de-duplication archival storage systems.. In Proc. of ACM ICS.Google ScholarDigital Library
- Jinwei Liu and Haiying Shen. 2016. A popularity-aware cost-effective replication scheme for high data durability in cloud storage. In Proc. of IEEE Big Data.Google ScholarCross Ref
- Mengting Lu, Fang Wang, Dan Feng, and Yuchong Hu. 2019. A Read-leveling Data Distribution Scheme for Promoting Read Performance in SSDs with Deduplication. In Proc. of ICPP.Google ScholarDigital Library
- A. Mendoza. 2013. Cold storage in the cloud: Trends, challenges, and solutions. Intel, White paper (2013).Google Scholar
- T. P MORGAN. October. Facebook loads up innovative cold storage datacenter. https://cloud.google.com/files/.Google Scholar
- Athicha Muthitacharoen, Benjie Chen, and David Mazieres. 2001. A low-bandwidth network file system. In Proc. of ACM SOSP.Google ScholarDigital Library
- P. NEWSON. 2015. Whitepaper: Google cloud storage nearline. https://cloud.google.com/files/ GoogleCloudStorageNearline.pdf (2015).Google Scholar
- Dorward S Quinlan S. 2002. Venti: A new approach to archival storage. In Proc. of USENIX FAST.Google Scholar
- M. Rabin. 1981. Fingerprinting by random polynomials.Google Scholar
- I. Reed and G. Solomon. 1960. Polynomial Codes over Certain Finite Fields. Journal of the Society for Industrial & Applied Mathematics 8, 2(1960), 300–304.Google ScholarCross Ref
- Russ Cox Rhea, Sean C. and Alex Pesterev. 2008. Fast, Inexpensive Content-Addressed Storage in Foundation. In Proc. of USENIX FAST.Google Scholar
- Vasily Tarasov, Deepak Jain, Geoff Kuenning, Sonam Mandal, Karthikeyani Palanisami, Philip Shilane, Sagar Trehan, and Erez Zadok. 2014. Dmdedup: Device Mapper Target for Data Deduplication. In Proc. of Ottawa Linux Symposium (OSL).Google Scholar
- Michael Vrable, Stefan Savage, and Geoffrey M Voelker. 2009. Cumulus: Filesystem backup to the cloud. ACM Trans. on Storage 5, 4 (2009), 14.Google ScholarDigital Library
- Carl A Waldspurger. 2002. Memory resource management in VMware ESX server. Proc. of ACM SIGOPS Operating Systems Review 36, SI (2002).Google ScholarCross Ref
- Grant Wallace, Fred Douglis, Hangwei Qian, Philip Shilane, Stephen Smaldone, Mark Chamness, and Windsor Hsu. 2012. Characteristics of backup workloads in production systems. In Proc. of USENIX FAST.Google Scholar
- Wen Xia, Hong Jiang, Dan Feng, and Yu Hua. 2011. SiLo: A Similarity-Locality based Near-Exact Deduplication Scheme with Low RAM Overhead and High Throughput.. In Proc. of USENIX ATC. 26–30.Google Scholar
- Wenrui Yan, Jie Yao, Qiang Cao, Changsheng Xie, and Hong Jiang. 2018. Ros: A rack-based optical storage system with inline accessibility for long-term data preservation. ACM Trans. on Storage 14, 3 (2018), 28.Google ScholarDigital Library
- Yucheng Zhang, Hong Jiang, Dan Feng, Wen Xia, Min Fu, Fangting Huang, and Yukun Zhou. 2015. AE: An asymmetric extremum content defined chunking algorithm for fast and bandwidth-efficient data deduplication. In Proc. of IEEE INFOCOM.Google ScholarCross Ref
- Benjamin Zhu, Kai Li, and R Hugo Patterson. 2008. Avoiding the Disk Bottleneck in the Data Domain Deduplication File System.. In Proc. of USENIX FAST.Google Scholar
Recommendations
Storage Deduplication by Virtual Large-Scale Disks
NBIS '12: Proceedings of the 2012 15th International Conference on Network-Based Information SystemsRecently, the demand of low cost large scale storages increases. We developed VLSD (Virtual Large Scale Disks) toolkit for constructing virtual disk based distributed storages, which aggregate free spaces of individual disks. VLSD realizes low-cost ...
Flash-Based Storage Deduplication Techniques: A Survey
Exponential growth of the amount of data stored worldwide together with high level of data redundancy motivates the active development of data deduplication techniques. The overall increasing popularity of solid-state drives (SSDs) as primary storage ...
A study of practical deduplication
We collected file system content data from 857 desktop computers at Microsoft over a span of 4 weeks. We analyzed the data to determine the relative efficacy of data deduplication, particularly considering whole-file versus block-level elimination of ...
Comments