Abstract
In-memory systems with erasure coding (EC) enabled are widely used to achieve high performance and data availability. However, as the scale of clusters grows, the server-level fail-slow problem is becoming increasingly frequent, which can create long tail latency. The influence of long tail latency is further amplified in EC-based systems due to the synchronous nature of multiple EC sub-operations. In this paper, we propose an EC-enabled in-memory storage system called Short Tail, which can achieve consistent performance and low latency for both reads and writes. First, Short Tail uses a lightweight request monitor to track the performance of each memory node and identify any fail-slow node. Second, Short Tail selectively performs degraded reads and redirected writes to avoid accessing fail-slow nodes. Finally, Short Tail posts an adaptive write strategy to reduce write amplification of small writes. We implement Short Tail on top of Memcached and compare it with two baseline systems. The experimental results show that Short Tail can reduce the P99 tail latency by up to 63.77%; it also brings significant improvements in the median latency and average latency.
摘要
为获得高性能和高数据可用性,基于纠删码的内存存储系统得到广泛应用。然而,随着集群规模不断增长,服务器级别的性能降级问题出现得越来越频繁,进而导致长尾延迟。在基于纠删码的系统中,由于一个纠删码操作可能依赖于多个子操作的同步完成,长尾延迟的影响被进一步放大。本文提出一种称为ShortTail的基于纠删码的内存存储系统,该系统可实现稳定的性能和较低的读写延迟。首先,ShortTail使用轻量请求监视器监测每个内存节点性能,以便及时发现性能降级节点。其次,ShortTail选择性执行降级读操作和重定向写操作,以避免访问性能降级节点。最后,ShortTail采用一种自适应写策略降低小写请求的写放大程度。本文在Memcached上实现了ShortTail,并将其与两个系统进行比较。实验结果表明,ShortTail最高可降低63.77%的99分位延迟,且显著改善中位延迟和平均延迟。
Similar content being viewed by others
References
Abebe M, Daudjee K, Glasbergen B, et al., 2018. EC-Store: bridging the gap between storage and latency in distributed erasure coded systems. Proc IEEE 38th Int Conf on Distributed Computing System, p.255–266. https://doi.org/10.1109/ICDCS.2018.00034
Andersen DG, Balakrishnan H, Kaashoek MF, et al., 2005. Improving web availability for clients with MONET. Proc 2nd Symp on Networked Systems Design and Implementation, p.115–128.
Balaji SB, Krishnan MN, Vajha M, et al., 2018. Erasure coding for distributed storage: an overview. Sci China, Inform Sci, 61(10):100301. https://doi.org/10.1007/s11432-018-9482-6
Cooper BF, Silberstein A, Tam E, et al., 2010. Benchmarking cloud serving systems with YCSB. Proc 1st ACM Symp on Cloud Computing, p.143–154. https://doi.org/10.1145/1807128.1807152
Dimakis AG, Godfrey PB, Wu YN, et al., 2010. Network coding for distributed storage systems. IEEE Trons Inform Theory, 56(9):4539–4551. https://doi.org/10.1109/TIT.2010.2054295
Dragojević A, Narayanan D, Hodson O, et al., 2014. FaRM: fast remote memory. Proc 11th USENIX Conf on Networked Systems Design and Implementation, p.401–414.
Dragojević A, Narayanan D, Nightingale EB, et al., 2015. No compromises: distributed transactions with consistency, availability, and performance. Proc 25th Symp on Operating Systems Principles, p.54–70. https://doi.org/10.1145/2815400.2815425
Fan B, Andersen DG, Kaminsky M, 2013. MemC3: compact and concurrent MemCache with dumber caching and smarter hashing. Proc 10th USENIX Conf on Networked Systems Design and Implementation, p.371–384.
Ford D, Labelle F, Popovici FI, et al., 2010. Availability in globally distributed storage systems. Proc 9th USENIX Conf on Operating Systems Design and Implementation, p.61–74.
Ganjam A, Jiang JC, Liu X, et al., 2015. C3: Internetscale control plane for video quality optimization. Proc 12th USENIX Conf on Networked Systems Design and Implementation, p.131–144.
Gunawi HS, Suminto RO, Sears R, et al., 2018. Fail-slow at scale: evidence of hardware performance faults in large production systems. Proc 16th USENIX Conf on File and Storage Technologies, p.1–14.
Hu YC, Niu D, 2016. Reducing access latency in erasure coded cloud storage with local block migration. Proc 35th Annual IEEE Int Conf on Computer Communications, p.1–9. https://doi.org/10.1109/INFOCOM.2016.7524628
Hu YC, Wang YS, Liu B, et al., 2017. Latency reduction and load balancing in coded storage systems. Symp on Cloud Computing, p.365–377. https://doi.org/10.1145/3127479.3131623
Hu YC, Cheng LF, Yao QR, et al., 2021. Exploiting combined locality for wide-stripe erasure coding in distributed storage. Proc 19th USENIX Conf on File and Storage Technologies, p.233–248.
Huang C, Simitci H, Xu YK, et al., 2012. Erasure coding in windows azure storage. USENIX Conf on Annual Technical Conf, p.2.
Huang P, Guo CX, Zhou LD, et al., 2017. Gray failure: the Achilles’ heel of cloud-scale systems. Proc 16th Workshop on Hot Topics in Operating Systems, p.150–155. https://doi.org/10.1145/3102980.3103005
Intel, 2015. Intel Announces Optane Storage Brand for 3D XPoint Products. https://www.anandtech.com/show/9541/intel-announces-optane-storage-brand-for-3d-xpoint-products [Accessed on Nov. 8, 2021].
Kalia A, Kaminsky M, Andersen DG, 2014. Using RDMA efficiently for key-value services. SIGCOMM Comput Commun Rev, 44(4):295–306. https://doi.org/10.1145/2740070.2626299
Kalia A, Kaminsky M, Andersen DG, 2016. FaSST: fast, scalable and simple distributed transactions with two-sided (RDMA) datagram RPCs. Proc 12th USENIX Symp on Operating Systems Design and Implementation, p.185–201.
Lamport L, 1998. The part-time parliament. ACM Trons Comput Syst, 16(2):133–169. https://doi.org/10.1145/279227.279229
Li C, Porto D, Clement A, et al., 2012. Making geo-replicated systems fast as possible, consistent when necessary. Proc 10th USENIX Conf on Operating Systems Design and Implementation, p.265–278.
Li XL, Li RH, Lee PPC, et al., 2019. OpenEC: toward unified and configurable erasure coding management in distributed storage systems. Proc 17th USENIX Conf on File and Storage Technologies, p.331–344.
Lin SY, Gong GW, Shen ZR, et al., 2021. Boosting full-node repair in erasure-coded storage. USENIX Annual Technical Conf, p.641–655.
Narayanan D, Donnelly A, Rowstron A, 2008. Write off-loading: practical power management for enterprise storage. ACM Trons Storoge, 4(3):10. https://doi.org/10.1145/1416944.1416949
Nishtala R, Fugal H, Grimm S, et al., 2013. Scaling memcache at Facebook. Proc 10th USENIX Symp on Networked Systems Design and Implementation, p.385–398.
Ovsiannikov M, Rus S, Reeves D, et al., 2013. The quantcast file system. Proc VLDB Endow, 6(11):1092–1101. https://doi.org/10.14778/2536222.2536234
Pagh R, Rodler FF, 2004. Cuckoo hashing. J Algor, 51(2):122–144. https://doi.org/10.1016/j.jalgor.2003.12.002
Pamies-Juarez L, Blagojevic F, Mateescu R, et al., 2016. Opening the chrysalis: on the real repair performance of MSR codes. Proc 14th USENIX Conf on File and Storage Technologies, p.81–94.
Plank JS, Huang C, 2013. Tutorial: erasure coding for storage applications. Proc 11th USENIX Conf on File and Storage Technologies.
Poke M, Hoefler T, 2015. DARE: high-performance state machine replication on RDMA networks. Proc 24th Int Symp on High-Performance Parallel and Distributed Computing, p.107–118. https://doi.org/10.1145/2749246.2749267
Rashmi KV, Nakkiran P, Wang JY, et al., 2015. Having your cake and eating it too: jointly optimal erasure codes for I/O, storage and network-bandwidth. Proc 13th USENIX Conf on File and Storage Technologies, p.81–94.
Rashmi KV, Chowdhury M, Kosaian J, et al., 2016. EC-Cache: load-balanced, low-latency cluster caching with online erasure coding. Proc 12th USENIX Conf on Operating Systems Design and Implementation, p.401–417.
Reed IS, Solomon G, 1960. Polynomial codes over certain finite fields. J Soc Ind Appl Math, 8(2):300–304. https://doi.org/10.1137/0108018
Shah NB, Lee K, Ramchandran K, 2016. When do redundant requests reduce latency? IEEE Trans Commun, 64(2):715–722. https://doi.org/10.1109/TCOMM.2015.2506161
Stewart C, Chakrabarti A, Griffith R, 2013. Zoolander: efficiently meeting very strict, low-latency SLOs. Proc 10th Int Conf on Autonomic Computing, p.265–277.
Uluyol M, Huang A, Goel A, et al., 2020. Near-optimal latency versus cost tradeoffs in geo-distributed storage. Proc 17th USENIX Symp on Networked Systems Design and Implementation, p.157–180.
Vajha M, Ramkumar V, Puranik B, et al., 2018. Clay codes: moulding MDS codes to yield an MSR code. Proc 16th USENIX Conf on File and Storage Technologies, p.139–154.
Weil SA, Brandt SA, Miller EL, et al., 2006. Ceph: a scalable, high-performance distributed file system. Proc 7th Symp on Operating Systems Design and Implementation, p.307–320.
Wilcox-O’Hearn Z, Warner B, 2008. Tahoe: the least-authority filesystem. Proc 4th ACM Int Workshop on Storage Security and Survivability, p.21–26. https://doi.org/10.1145/1456469.1456474
Wilkes J, Golding R, Staelin C, et al., 1996. The HP AutoRAID hierarchical storage system. ACM Trans Comput Syst, 14(1):108–136. https://doi.org/10.1145/225535.225539
Wu SZ, Mao B, Chen XL, et al., 2016. LDM: log disk mirroring with improved performance and reliability for SSD-based disk arrays. ACM Trans Storage, 12(4):22. https://doi.org/10.1145/2892639
Author information
Authors and Affiliations
Contributions
Yun TENG designed the research and implemented the prototype system. Yun TENG and Zhiyue LI drafted the paper. Jing HUANG and Guangyan ZHANG helped organize the paper. Yun TENG and Guangyan ZHANG revised and finalized the paper.
Corresponding author
Additional information
Compliance with ethics guidelines
Yun TENG, Zhiyue LI, Jing HUANG, and Guangyan ZHANG declare that they have no conflict of interest.
Project supported by the National Natural Science Foundation of China (No. 62025203) and the Changchun Key Scientific and Technological Research and Development Project, China (No. 21ZGN30)
Rights and permissions
About this article
Cite this article
Teng, Y., Li, Z., Huang, J. et al. Short Tail: taming tail latency for erasure-code-based in-memory systems. Front Inform Technol Electron Eng 23, 1646–1657 (2022). https://doi.org/10.1631/FITEE.2100566
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1631/FITEE.2100566