Abstract
Exponential growth of the amount of data stored worldwide together with high level of data redundancy motivates the active development of data deduplication techniques. The overall increasing popularity of solid-state drives (SSDs) as primary storage devices forces the adaptation of deduplication techniques to technical peculiarities of this type of storage (such as write amplification and wearout), implying active research in SSD-equipped storage data deduplication subdomain. In this survey paper the authors summarize the recent results on deduplication in SSD-enhanced storage, providing a novel taxonomy of the techniques. They classify the techniques on the basis of storage device complexity, starting from a sub-device level up to the storage network. Linux deduplication implementations are discussed, and the results of experimental comparison of several widely used tools are presented. Finally, the authors briefly outline open problems in the field and possible points of future research.
- Ajdari, M., Park, P., Kwon, D., Kim, J., & Kim, J. (2018). A Scalable HW-Based Inline Deduplication for SSD Arrays. IEEE Computer Architecture Letters, 17(1), 47–50. doi:10.1109/LCA.2017.2753258. Google ScholarDigital Library
- Albireo virtual data optimizer (vdo) on drbd. (n.d.). Linbit. Retrieved from https://www.linbit.com/en/albireo-virtual-data-optimizer-vdo-on-drbd/Google Scholar
- Bowling, J. (2013). Opendedup: open-source deduplication put to the test. Linux Journal, (228), 2. Google ScholarDigital Library
- Chen, X., Chen, W., Lu, Z., Long, P., Yang, S., & Wang, Z. (2017). A duplication-aware ssd-based cache architecture for primary storage in virtualization environment. IEEE Systems Journal, 11(4), 2578–2589. doi:10.1109/JSYST.2015.2494377.Google ScholarCross Ref
- Chen, Z., Chen, Z., Xiao, N., & Liu, F. (2015). Nf-dedupe: A novel no-fingerprint deduplication scheme for flash-based ssds. In 2015 IEEE symposium on computers and communication (ISCC) (pp. 588–594). New York: IEEE. doi:10.1109/ISCC.2015.7405578. Google ScholarDigital Library
- ChuanW. B.RenS. Q.KeohS. L.AungK. M. M. (2015). Flexible yet secure de-duplication service for enterprise data on cloud storage. In International Conference on Cloud Computing Research and Innovation (ICCCRI) (pp. 37-44). IEEE. 10.1109/ICCCRI.2015.11 Google ScholarDigital Library
- Data deduplication and compression with vdo. (n.d.). Redhat. Retrieved from https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html-single/storage_administration_guide/vdoGoogle Scholar
- Dirik, C., & Jacob, B. (2009). The performance of pc solid-state disks (ssds) as a function of bandwidth, concurrency, device architecture, and system organization. SIGARCH Comput. Archit. News, 37(3), 279–289. doi:10.1145/1555815.1555790. Google ScholarDigital Library
- DuY.ZhangY.XiaoN. (2014). R-Dedup: content aware redundancy management for SSD-based RAID systems. In 43rd International Conference on Parallel Processing (ICPP) (pp. 111-120). IEEE. 10.1109/ICPP.2014.20 Google ScholarDigital Library
- Freudenberger, J., Rajab, M., Rohweder, D., & Safieh, M. (2018). A codec architecture for the compression of short data blocks. Journal of Circuits, Systems, and Computers, 27(2), 1850019. doi:10.1142/S0218126618500196.Google ScholarCross Ref
- Freudenbrger, J., Beck, A., & Rajab, M. (2015). A data compression scheme for reliable data storage in non-volatile memories. In 2015 IEEE 5th international conference on consumer electronics, Bilbao, Spain (pp. 139-142). EHU Press. 10.1109/ICCE-Berlin.2015.7391216Google ScholarCross Ref
- Ha, J.-Y., Lee, Y.-S., & Kim, J.-S. (2013). Deduplication with block-level content-aware chunking for solid state drives (SSDs). In 2013 IEEE 15TH international conference on high performance computing and communications & 2013 IEEE international conference on embedded and ubiquitous computing (HPCC EUC) (pp. 1982–1989). New York: IEEE.Google Scholar
- Heo, H., Ahn, C., & Kim, D. H. (2016). Parity Data De-Duplication in All Flash Array-Based OpenStack Cloud Block Storage. IEICE Transactions on Information and Systems, 99(5), 1384–1387.Google ScholarCross Ref
- Hua, Y., Liu, X., & Feng, D. (2013). Smart in-network deduplication for storage-aware SDN. Computer Communication Review, 43(4), 509–510. doi:10.1145/2534169.2491714. Google ScholarDigital Library
- Huang, W., Chen, C., Chen, Y., & Chen, C. (2005). A compression layer for NAND type flash memory systems. In X. He, T. Hintz, M. Piccardi et al. (Eds.), Third International Conference on Information Technology and Applications (Vol. 1, pp. 599-604). Los Alamitos, CA: IEEE Computer Society. 10.1109/ICITA.2005.5 Google ScholarDigital Library
- Huang, W. T., Chen, C. T., & Chen, C. H. (2007). The real-time compression layer for flash memory in mobile multimedia devices. In S. Kim, J. Park, N. Pissinou et al. (Eds.), MUE: 2007 International conference on multimedia and ubiquitous engineering, proceedings. Los Alamitos, CA: IEEE Computer soc. 10.1109/MUE.2007.206 Google ScholarDigital Library
- Jagmohan, A., Franceschini, M., & Lastras, L. (2010). Write amplification reduction in NAND flash through multi-write coding. In M. Khatibm, X. He & M. Factor (Eds.), 2010 IEEE 26th symposium on mass storage systems and technologies (MSST), IEEE Symposium on Mass Storage Systems-Proceedings. New York: IEEE. 10.1109/MSST.2010.5496985 Google ScholarDigital Library
- Kaplan, R., Yavits, L., Morad, A., & Ginosar, R. (2016). Deduplication in resistive content addressable memory based solid state drive. In 2016 26th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS) (pp. 100-106). IEEE. 10.1109/PATMOS.2016.7833432Google ScholarCross Ref
- Kilvansky, M. (2004). A thorough introduction to flexclone volumes. NetApp.Google Scholar
- Kim, J., Lee, C., Lee, S., Son, I., Choi, J., Yoon, S., . . . Cha, J. (2012). Deduplication in SSDs: Model and quantitative analysis. In 2012 IEEE 28th symposium on mass storage systems and technologies (MSST). New York: IEEE.Google ScholarCross Ref
- Kim, K., Jung, S., & Song, Y. H. (2011). Compression ratio based hot/cold data identification for flash memory. In IEEE International conference on consumer electronics (ICCE 2011) (pp. 33-34) New York, USA. IEEE.Google ScholarCross Ref
- Kim, T., Lee, S., and Kim, J. (2017). FineDedup: A fine-grained deduplication technique for extending lifetime of flash-based SSDs. Journal of semiconductor technology and science, 17(5):648-659.Google ScholarCross Ref
- Kim, T., Lee, S., Park, J., & Kim, J. (2016). Efficient lifetime management of SSD-based RAIDs using dedup-assisted partial stripe writes. In 2016 5TH Non-volatile memory systems and applications symposium (NVMSA). New York: IEEE.Google Scholar
- Kjelso, M., & Jones, S. (1995). Memory management in flash-memory disks with data compression. In H. Baker (Eds.), Memory management (pp. 399-413). Springer-Verlag Berlin. doi:10.1007/3-540-60368-9_36. Google ScholarDigital Library
- Lee, S., Kim, T., Park, J.-S., & Kim, J. (2013). An integrated approach for managing the lifetime of flash-based SSDs. In Design, automation & test in Europe, Design Automation and Test in Europe Conference and Exhibition (pp. 1522-1525). New York: Assoc computing machinery. doi:10.7873/DATE.2013.309. Google ScholarDigital Library
- Lee, S., Park, J., Fleming, K., Arvind, , & Kim, J. (2011). Improving performance and lifetime of solid-state drives using hardware-accelerated compression. IEEE Transactions on Consumer Electronics, 57(4), 1732–1739. doi:10.1109/TCE.2011.6131148.Google ScholarCross Ref
- LiC.WangS.XiaoC.ZhouX.WuG. (2014). MMD: an approach to improve reading performance in deduplication systems. In 9th IEEE International Conference on Networking, Architecture, and Storage (NAS) (pp. 93-97). IEEE. 10.1109/NAS.2014.21 Google ScholarDigital Library
- Li, W., Jean-Baptise, G., Riveros, J., Narasimhan, G., Zhang, T., & Zhao, M. (2016). Cachededup: In-line deduplication for flash caching. In 14th Usenix conference on file and storage technologies (FAST‘16) (pp. 301-314). Berkeley, CA: USENIX ASSOC. Google ScholarDigital Library
- Li, Y., Wang, Y., Jiang, A. A., & Bruck, J. (2012). Content-assisted file decoding for nonvolatile memories. In M. Matthews (Ed.), 2012 conference record of the forty sixth asilomar conference on signals, systems and computers (ASILOMAR) (pp. 937-941). New York: IEEE.Google ScholarCross Ref
- Lim, S.-H., & Jeong, Y.-S. (2014). Journaling deduplication with invalidation scheme for flash storage-based smart systems. Journal of Systems Architecture, 60(8), 684–692. doi:10.1016/j.sysarc.2014.04.002. Google ScholarDigital Library
- Lin, L., Xiao, K., & Liu, W. (2016). Utilizing SSD to alleviate chunk fragmentation in de-duplicated backup systems. In 2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS) (pp. 616-624). IEEE. 10.1109/ICPADS.2016.0087Google ScholarCross Ref
- Liu, J., Chai, Y., Qin, X., & Xiao, Y. (2014). PLC-cache: Endurable SSD cache for deduplication-based primary storage. In 2014 30th symposium on massive storage systems and technologies (MSST). New York. IEEE.Google ScholarCross Ref
- Liu, J., Chai, Y., Yan, C., & Wang, X. (2016). A delayed container organization approach to improve restore speed for deduplication systems. IEEE Transactions on Parallel and Distributed Systems, 27(9), 2477–2491. doi:10.1109/TPDS.2015.2509060. Google ScholarDigital Library
- Liu, J., Chai, Y.-P., Qin, X., & Liu, Y.-H. (2018). Endurable SSD-based read cache for improving the performance of selective restore from deduplication systems. Journal of Computer Science and Technology, 33(1), 58–78. doi:10.1007/s11390-018-1808-5.Google ScholarCross Ref
- Ma, J., Stones, R. J., Ma, Y., Wang, J., Ren, J., Wang, G., & Liu, X. (2017). Lazy exact deduplication. ACM Transactions on Storage, 13(2), 1–26. doi:10.1145/3078837. Google ScholarDigital Library
- Ma, J., Wang, G., & Liu, X. (2016). DedupeSwift: object-oriented storage system based on data deduplication. In 2016 IEEE Trustcom/BigDataSE/I SPA (pp. 1069-1076). IEEE. doi:10.1109/TrustCom.2016.0177.Google Scholar
- Mandagere, N., Zhou, P., Smith, M. A., & Uttamchandani, S. (2008). Demystifying data deduplication. Companion (Gloucester), 8, 12–17. Google ScholarDigital Library
- Manogar, E., & Abirami, S. (2014). A study on data deduplication techniques for optimized storage. In 2014 Sixth International Conference on Advanced computing (ICoAC) (pp. 161-166). IEEE. 10.1109/ICoAC.2014.7229702Google ScholarCross Ref
- Mao, B., Jiang, H., Wu, S., Fu, Y., & Tian, L. (2012). SAR: SSD assisted restore optimization for deduplication-based storage systems in the cloud. In IEEE 7th International Conference on Networking, Architecture and Storage (NAS) (pp. 328-337). IEEE. Google ScholarDigital Library
- Mao, B., Jiang, H., Wu, S., Fu, Y., & Tian, L. (2014a). Read-performance optimization for deduplication-based storage systems in the cloud. ACM Transactions on Storage, 10(2), 1–22. doi:10.1145/2512348. Google ScholarDigital Library
- Mao, B., Jiang, H., Wu, S., & Tian, L. (2014b). POD: performance oriented I/O deduplication for primary storage systems in the cloud. In IEEE 28th International Parallel and Distributed Processing Symposium (pp. 767-776). IEEE. Google ScholarDigital Library
- Meister, D., & Brinkmann, A. (2010). dedupv1: Improving deduplication throughput using solid state drives (SSD). In M. Khatibm, X. He, and M. Factor (Eds.), 2010 IEEE 26TH symposium on mass storage systems and technologies (MSST). New York. IEEE. 10.1109/MSST.2010.5496992 Google ScholarDigital Library
- MeyerD. T.BoloskyW. J. (2011). A study of practical deduplication. In Proceedings of the 9th USENIX Conference on File and Storage Technologies, FAST’11. Berkeley, CA: USENIX Association. Google ScholarDigital Library
- Paik, J.-Y., Chung, T.-S., & Cho, E.-S. (2015). Application-aware deduplication for performance improvement of flash memory. Design Automation for Embedded Systems, 19(1-2), 161–188. doi:10.1007/s10617-014-9142-9. Google ScholarDigital Library
- Park, E. and Shin, D. (2015). Offline deduplication for solid state disk using a lightweight hash algorithm. JSTS: journal of semiconductor technology and science, 15(5), 539-545.Google Scholar
- Park, J., Lee, S., & Kim, J. (2017). DAC: Dedup-assisted compression scheme for improving lifetime of NAND storage systems. In Proc. of the 2017 design, automation & test in Europe conference & exhibition (DATE), Design Automation and Test in Europe Conference and Exhibition (pp. 1249–1252). New York: IEEE. doi:10.23919/DATE.2017.7927181. Google ScholarDigital Library
- Park, Y., & Kim, J.-S. (2011). zFTL: Power-efficient data compression support for NAND flash-based consumer electronics devices. IEEE Transactions on Consumer Electronics, 57(3), 1148–1156. doi:10.1109/TCE.2011.6018868.Google ScholarCross Ref
- Paulo, J., & Pereira, J. (2014). A survey and classification of storage deduplication systems. ACM Computing Surveys, 47(1), 1–30. doi:10.1145/2611778. Google ScholarDigital Library
- Peng, B., Jin, X., Wang, T., & Du, X. (2015). Design of a distributed compressor for astronomy ssd. In 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). New York: IEEE. 10.1109/FCCM.2015.29 Google ScholarDigital Library
- Seagate. (2017). Data age 2025: The evolution of data to life-critical.Google Scholar
- Seo, B.-K., Maeng, S., Lee, J., & Seo, E. (2015). DRACO: A deduplicating FTL for tangible extra capacity. IEEE Computer Architecture Letters, 14(2), 123–126. doi:10.1109/LCA.2014.2350984. Google ScholarDigital Library
- Seo, M.-K., & Lim, S.-H. (2010). Deduplication flash file system with PRAM for non-linear editing. IEEE Transactions on Consumer Electronics, 56(3), 1502–1510. doi:10.1109/TCE.2010.5606289. Google ScholarDigital Library
- Shiming, W., Zhiyong, X., Yao, Z., & Chengyu, F. (2015). PCIE interface design for high-speed image storage system based on SSD. In C. Tang, S. Chen, and X. Tang (Eds.), 20th international symposium on high-power laser systems and applications 2014, Bellingham, WA USA. SPIE-INT socoptical engineering.Google Scholar
- Shin, Y., Koo, D., & Hur, J. (2017). A survey of secure data deduplication schemes for cloud storage systems. ACM Computing Surveys, 49(4), 1–38. doi:10.1145/3017428. Google ScholarDigital Library
- Wei, D., Gong, Y., Qiao, L., & Deng, L. (2014). A Hardware-Software Co-design Experiments Platform for NAND Flash Based on Zynq. In 2014 IEEE 20th international conference on embedded and real-time computing systems and applications (RTCSA). New York: IEEE.Google Scholar
- Xia, W., Jiang, H., Feng, D., Douglis, F., Shilane, P., Hua, Y., & Zhou, Y. et al. (2016). A comprehensive study of the past, present, and future of data deduplication. Proceedings of the IEEE, 104(9), 1681–1710. doi:10.1109/JPROC.2016.2571298.Google ScholarCross Ref
- Xie, N., Dong, G., & Zhang, T. (2011). Using lossless data compression in data storage systems: Not for saving space. IEEE Transactions on Computers, 60(3), 335–345. doi:10.1109/TC.2010.150. Google ScholarDigital Library
- Yim, K., Koh, K., & Bahn, H. (2003). A compressed page management scheme for NAND-type flash memory. In H. Arabnia & L. Yang (Eds.), VLSI’03: Proceedings of the international conference on VLSI, Athens, GA (pp. 266-271). CSREA Press.Google Scholar
- Zhang, B., Wang, C., Zhou, B. B., Yuan, D., & Zomaya, A. Y. (2018). DCDedupe: Selective deduplication and delta compression with effective routing for distributed storage. Journal of Grid Computing, 16(2), 195–209. doi:10.1007/s10723-018-9429-3. Google ScholarDigital Library
- Zhang, B., Wang, C., Zhou, B. B., & Zomaya, A. Y. (2015). Inline data deduplication for SSD-based distributed storage. In IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS) (pp. 593-600). IEEE. Google ScholarDigital Library
- Zhang, X., Li, J., Wang, H., Zhao, K., & Zhang, T. (2016). Reducing solid-state storage device write stress through opportunistic in-place delta compression. In 14TH USENIX Conference on file and storage technologies (FAST ‘16) (pp. 111-124). Berkeley, CA: USENIX ASSOC. Google ScholarDigital Library
- Zhao, X., Zhang, Y., Wu, Y., Chen, K., Jiang, J., & Li, K. (2014). Liquid: A scalable deduplication file system for virtual machine images. IEEE Transactions on Parallel and Distributed Systems, 25(5), 1257–1266. doi:10.1109/TPDS.2013.173. Google ScholarDigital Library
- Zhou, R., Liu, M., & Li, T. (2013). Characterizing the efficiency of data deduplication for big data storage management. In 2013 IEEE International symposium on workload characterization (IISWC 2013) (pp. 98-108). New York: IEEE. 10.1109/IISWC.2013.6704674Google ScholarCross Ref
Recommendations
Survey on Deduplication Techniques in Flash-Based Storage
FRUCT'22: Proceedings of the 22st Conference of Open Innovations Association FRUCTData deduplication importance is growing with the growth of data volumes. The domain of data deduplication is in active development. Recently it was in?uenced by appearance of Solid State Drive. This new type of disk has signi?cant differences from ...
Storage Deduplication by Virtual Large-Scale Disks
NBIS '12: Proceedings of the 2012 15th International Conference on Network-Based Information SystemsRecently, the demand of low cost large scale storages increases. We developed VLSD (Virtual Large Scale Disks) toolkit for constructing virtual disk based distributed storages, which aggregate free spaces of individual disks. VLSD realizes low-cost ...
WOJ: Enabling Write-Once Full-data Journaling in SSDs by Using Weak-Hashing-based Deduplication
Journaling is a commonly used technique to ensure data consistency in file systems, such as ext3 and ext4. With journaling technique, file system updates are first recorded in a journal (in the commit phase) and later applied to their home locations in ...
Comments