research-article

Coupling Right-Provisioned Cold Storage Data Centers with Deduplication

Authors:
Liangfeng Cheng

Huazhong University of Science and Technology, China

Huazhong University of Science and Technology, China
View Profile

,
Yuchong Hu

Huazhong University of Science and Technology, China

Huazhong University of Science and Technology, China
View Profile

,
Zhaokang Ke

Huazhong University of Science and Technology, China

Huazhong University of Science and Technology, China
View Profile

,
Zhongjie Wu

Alibaba, China

Alibaba, China
View Profile

ICPP '21: Proceedings of the 50th International Conference on Parallel ProcessingAugust 2021Article No.: 17Pages 1–11https://doi.org/10.1145/3472456.3472485

Published:05 October 2021Publication History

ICPP '21: Proceedings of the 50th International Conference on Parallel Processing

Pages 1–11

ABSTRACT

Modern cloud-scale cold storage data centers have begun to support right-provisioning of a rack’s resources (power, cooling, etc.), which allows only a small fraction of all hard disks to be active (spinning) concurrently at any given time to reduce the cost of ownership. Data deduplication is a traditional approach to split files into chunks and eliminate duplicate chunks, which can also cut costs for cold storage systems. However, when combined with right-provisioning, classical deduplication may make a file deduplicated and stored across the disks some of which are not active currently, thus leading to unacceptable access performance caused by spinning up and down of the disks.

In this paper, we analyze the deduplication ratio under real-world workloads of cloud cold storage and observe for most workloads: 1) the deduplication ratio generally increases quickly with the first few of versions of the workload, and 2) increases slowly but steadily with the subsequent versions as a long tail. Based on the first observation, we propose an online deduplication way that can improve the deduplication ratio while providing acceptable read performance; based on the second one, we propose an additional offline deduplication way that can achieve comparable deduplication ratios with classical deduplication. We design a cold storage system called DeCold via combining the above two deduplication ways as well as improving deduplication efficiency. We prototype DeCold and conduct testbed experiments on real-world datasets including source code, virtual machine and database. Evaluations show that DeCold achieves better file access performance over the classical deduplication implementation, while maintaining decent deduplication efficiency.

References

[n.d.]. GCC source code. http://ftp.gnu.org/gnu/gcc/.Google Scholar
[n.d.]. Linux Kernel. http://www.kernel.org/.Google Scholar
[n.d.]. Microsoft Azure Cool Blob Storage. https://azure.microsoft.com/en-us/blog/introducing-azure-cool-storage.Google Scholar
[n.d.]. Redis. https://redis.io/.Google Scholar
2012. Amazon glacier. http://aws.amazon.com/glacier/.Google Scholar
2016. Opendedup. http://www.opendedup.org/.Google Scholar
Samer Al-Kiswany, Dinesh Subhraveti, Prasenjit Sarkar, and Matei Ripeanu. 2011. VMFlock: virtual machine co-migration for the cloud. In Proc. of ACM HPDC.Google ScholarDigital Library
George Amvrosiadis and Medha Bhadkamkar. 2015. Identifying trends in enterprise data protection systems. In Proc. of USENIX ATC.Google Scholar
Shobana Balakrishnan, Richard Black, Austin Donnelly, Paul England, Adam Glass, Dave Harper, Sergey Legtchenko, Aaron Ogus, Eric Peterson, and Antony Rowstron. 2014. Pelican: A building block for exascale cold data storage. In Proc. of USENIX OSDI.Google Scholar
Richard Black, Austin Donnelly, Dave Harper, Aaron Ogus, and Anthony Rowstron. 2016. Feeding the pelican: Using archival hard drives for cold storage racks. In Proc. of USENIX HotStorage.Google Scholar
Renata Borovica-Gajić, Raja Appuswamy, and Anastasia Ailamaki. 2016. Cheap data analytics using cold storage devices. In Proc. of VLDB Endowment.Google ScholarDigital Library
Wenxiang Chen, Yuchong Hu, Siyang Yin, and Wen Xia. 2017. EEC-Dedup: Efficient Erasure-Coded Deduplicated Backup Storage Systems. In Proc. of IEEE ISPA. 251–258.Google ScholarCross Ref
Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proc. of ACM SoCC. 143–154.Google ScholarDigital Library
Cezary Dubnicki, Leszek Gryz, Lukasz Heldt, Michal Kaczmarczyk, Wojciech Kilian, Przemyslaw Strzelczak, Jerzy Szczepkowski, Cristian Ungureanu, and Michal Welnicki. 2009. HYDRAstor: A scalable secondary storage. In Proc. of USENIX FAST.Google Scholar
Min Fu, Dan Feng, Yu Hua, Xubin He, Zuoning Chen, Wen Xia, Fangting Huang, and Qing Liu. 2014. Accelerating restore and garbage collection in deduplication-based backup systems via exploiting historical information. In Proc. of USENIX ATC.Google Scholar
Min Fu, Dan Feng, Yu Hua, Xubin He, Zuoning Chen, Wen Xia, Yucheng Zhang, and Yujuan Tan. 2015. Design tradeoffs for data deduplication performance in backup workloads. In Proc. of USENIX FAST. 331–344.Google Scholar
Sergey Legtchenko, Xiaozhou Li, Antony Rowstron, Austin Donnelly, and Richard Black. 2016. Flamingo: Enabling evolvable hdd-based near-line storage. In Proc. of USENIX FAST.Google Scholar
Mark Lillibridge, Kave Eshghi, and Deepavali Bhagwat. 2013. Improving restore speed for backup systems that use inline chunk-based deduplication. In Proc. of USENIX FAST.Google Scholar
Mark Lillibridge, Kave Eshghi, Deepavali Bhagwat, Vinay Deolalikar, Greg Trezis, and Peter Camble. 2009. Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality. In Proc. of USENIX FAST.Google Scholar
Chuanyi Liu, Yu Gu, Linchun Sun, Bin Yan, and Dongsheng Wang. 2009. R-admad: High reliability provision for large-scale de-duplication archival storage systems.. In Proc. of ACM ICS.Google ScholarDigital Library
Jinwei Liu and Haiying Shen. 2016. A popularity-aware cost-effective replication scheme for high data durability in cloud storage. In Proc. of IEEE Big Data.Google ScholarCross Ref
Mengting Lu, Fang Wang, Dan Feng, and Yuchong Hu. 2019. A Read-leveling Data Distribution Scheme for Promoting Read Performance in SSDs with Deduplication. In Proc. of ICPP.Google ScholarDigital Library
A. Mendoza. 2013. Cold storage in the cloud: Trends, challenges, and solutions. Intel, White paper (2013).Google Scholar
T. P MORGAN. October. Facebook loads up innovative cold storage datacenter. https://cloud.google.com/files/.Google Scholar
Athicha Muthitacharoen, Benjie Chen, and David Mazieres. 2001. A low-bandwidth network file system. In Proc. of ACM SOSP.Google ScholarDigital Library
P. NEWSON. 2015. Whitepaper: Google cloud storage nearline. https://cloud.google.com/files/ GoogleCloudStorageNearline.pdf (2015).Google Scholar
Dorward S Quinlan S. 2002. Venti: A new approach to archival storage. In Proc. of USENIX FAST.Google Scholar
M. Rabin. 1981. Fingerprinting by random polynomials.Google Scholar
I. Reed and G. Solomon. 1960. Polynomial Codes over Certain Finite Fields. Journal of the Society for Industrial & Applied Mathematics 8, 2(1960), 300–304.Google ScholarCross Ref
Russ Cox Rhea, Sean C. and Alex Pesterev. 2008. Fast, Inexpensive Content-Addressed Storage in Foundation. In Proc. of USENIX FAST.Google Scholar
Vasily Tarasov, Deepak Jain, Geoff Kuenning, Sonam Mandal, Karthikeyani Palanisami, Philip Shilane, Sagar Trehan, and Erez Zadok. 2014. Dmdedup: Device Mapper Target for Data Deduplication. In Proc. of Ottawa Linux Symposium (OSL).Google Scholar
Michael Vrable, Stefan Savage, and Geoffrey M Voelker. 2009. Cumulus: Filesystem backup to the cloud. ACM Trans. on Storage 5, 4 (2009), 14.Google ScholarDigital Library
Carl A Waldspurger. 2002. Memory resource management in VMware ESX server. Proc. of ACM SIGOPS Operating Systems Review 36, SI (2002).Google ScholarCross Ref
Grant Wallace, Fred Douglis, Hangwei Qian, Philip Shilane, Stephen Smaldone, Mark Chamness, and Windsor Hsu. 2012. Characteristics of backup workloads in production systems. In Proc. of USENIX FAST.Google Scholar
Wen Xia, Hong Jiang, Dan Feng, and Yu Hua. 2011. SiLo: A Similarity-Locality based Near-Exact Deduplication Scheme with Low RAM Overhead and High Throughput.. In Proc. of USENIX ATC. 26–30.Google Scholar
Wenrui Yan, Jie Yao, Qiang Cao, Changsheng Xie, and Hong Jiang. 2018. Ros: A rack-based optical storage system with inline accessibility for long-term data preservation. ACM Trans. on Storage 14, 3 (2018), 28.Google ScholarDigital Library
Yucheng Zhang, Hong Jiang, Dan Feng, Wen Xia, Min Fu, Fangting Huang, and Yukun Zhou. 2015. AE: An asymmetric extremum content defined chunking algorithm for fast and bandwidth-efficient data deduplication. In Proc. of IEEE INFOCOM.Google ScholarCross Ref
Benjamin Zhu, Kai Li, and R Hugo Patterson. 2008. Avoiding the Disk Bottleneck in the Data Domain Deduplication File System.. In Proc. of USENIX FAST.Google Scholar

Recommendations

Storage Deduplication by Virtual Large-Scale Disks
NBIS '12: Proceedings of the 2012 15th International Conference on Network-Based Information Systems

Recently, the demand of low cost large scale storages increases. We developed VLSD (Virtual Large Scale Disks) toolkit for constructing virtual disk based distributed storages, which aggregate free spaces of individual disks. VLSD realizes low-cost ...
Read More
Flash-Based Storage Deduplication Techniques: A Survey

Exponential growth of the amount of data stored worldwide together with high level of data redundancy motivates the active development of data deduplication techniques. The overall increasing popularity of solid-state drives (SSDs) as primary storage ...
Read More
A study of practical deduplication

We collected file system content data from 857 desktop computers at Microsoft over a span of 4 weeks. We analyzed the data to determine the relative efficacy of data deduplication, particularly considering whole-file versus block-level elimination of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ICPP '21: Proceedings of the 50th International Conference on Parallel Processing
August 2021
927 pages
ISBN:9781450390682
DOI:10.1145/3472456

Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 5 October 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Cold storage
Deduplication
Right-provisioning
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate91of313submissions,29%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 163
  Total Downloads
- Downloads (Last 12 months)53
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Coupling Right-Provisioned Cold Storage Data Centers with Deduplication

ICPP '21: Proceedings of the 50th International Conference on Parallel Processing

ABSTRACT

References

Cited By

Recommendations

Storage Deduplication by Virtual Large-Scale Disks

Flash-Based Storage Deduplication Techniques: A Survey

A study of practical deduplication

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Coupling Right-Provisioned Cold Storage Data Centers with Deduplication

ICPP '21: Proceedings of the 50th International Conference on Parallel Processing

ABSTRACT

References

Cited By

Recommendations

Storage Deduplication by Virtual Large-Scale Disks

Flash-Based Storage Deduplication Techniques: A Survey

A study of practical deduplication

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media