Live Recovery of Bit Corruptions in Datacenter Storage Systems.

To do so, we present DIRECT, a novel set of policies that leverages latent redundancy in distributed storage systems to recover from bit corruption errors with minimal performance and recovery overhead ... Due to its high performance and decreasing cost per bit, flash is becoming the main storage medium in datacenters for hot data. ... We also show how increasing the resiliency to bit-level errors can significantly reduce storage costs and improve live recovery speed in datacenter environments. ...

arXiv:1805.02790v2 fatcat:htzwylssjnemzlkpc7jx5m67ju

Multiple Versions

Due to its high performance and decreasing cost per bit, flash storage is the main storage medium in datacenters for hot data. ... By significantly increasing the availability of distributed storage systems in the face of bit errors, DIRECT helps extend flash lifetimes. ... We also show how increasing the resiliency to bit errors can significantly reduce storage costs and improve live recovery speed in datacenter environments. ...

dblp:conf/usenix/TaiKKJFC19 fatcat:4jjnw5fafzc25cz5gasrfziufe

To assure stability in storage costs with best practice to assure content failure or loss recovery, we applied a Maximum Distance Separable such as Reed-Solomon to our RMCSA. ... The increasing popularity of cloud storage services has lead companies that handle critical content to think about using these services for their daily storage needs. ... The authors express their appreciation to Nanjing University of Science and Technology for creating a research fostering environment. ...

doi:10.23953/cloud.ijaccar.260 fatcat:cwifw7w2j5ch3p5nlx5mxms3da

Multicore processing, virtualization, distributed storage systems and an overarching management framework that enable a Cloud, offer a plethora of possibilities to provide high availability using commodity ... With its geographical spread and value proposition comes the need to provide guaranteed level of availability in the infrastructure and in its services. ... ACKNOWLEDGMENT This work was supported in part by NSF grant CNS 10-18503 CISE, the Department of Energy under Award Number DE-OE0000097, the Air Force Office of Scientific Research, under agreement number ...

doi:10.1109/dsnw.2012.6264687 dblp:conf/dsn/PhamCKI12 fatcat:chins7svy5h2rihrlnqcsyhc7e

In this paper, we present PaxosStore, a high-availability storage system developed to support the comprehensive business of WeChat. ... It employs a combinational design in the storage layer to engage multiple storage engines constructed for different storage models. ... Acknowledgments We would like to thank the anonymous reviewers and shepherd of the paper for their insightful feedback that helped improve the paper. ...

doi:10.14778/3137765.3137778 fatcat:rydvf7xt7rb5vmpkjzijw4tfui

Now, a decade later, we revisit RoCE's design points and conclude that several of its shortcomings must be addressed to fulfill the demands of hyperscale datacenters. ... We observe that emerging artificial intelligence, high-performance computing, and storage workloads pose new challenges for large-scale datacenter networking. ... This implies that packets can only be dropped if they are corrupted by bit errors, a very rare event. ...

arXiv:2302.03337v1 fatcat:kra475y64bcp3jb54frgd5j5ra

Open Access

Occasional corruption of stored data is an unfortunate byproduct of the complexity of modern systems. ... The dominant practice to deal with data corruption today involves administrators writing ad hoc scripts that run dataintegrity tests at the application, database, file-system, and storage levels. ... INTRODUCTION Data corruption-where bits of data in persistent storage differ from what they are supposed to be-is an ugly reality that database and storage administrators have to deal with occasionally ...

dblp:journals/pvldb/BorisovB11 fatcat:t4dwjixfjvb2tfn4e4gqx5pzgu

In a large datacenter with 100,000 nodes, we expect small reads to complete in less than 10μs, which is 50 to 1,000 times faster than the storage systems commonly used today. ... Over the past 15 years, the use of DRAM in storage systems has accelerated, driven by the needs of large-scale Web applications. ... -Coordinator crashes -Corruption of segments, either in DRAM or on secondary storage. Multiple failures can occur simultaneously. ...

doi:10.1145/2806887 fatcat:fg3r5yahbjhxhcor6m2w2q6bxy

We measure recovery of 92 GB data lost to disk failure in 6.2 s and recovery from a total machine failure with 655 GB of data in 33.7 s. ... Flat Datacenter Storage (FDS) is a high-performance, fault-tolerant, large-scale, locality-oblivious blob store. ... Johnson Apacible, Rich Draves, and Reuben Olinsky were part of the sort record team. Trevor Eberl, Jamie Lee, Oleg Losinets and Lucas Williamson provided systems support. ...

dblp:conf/osdi/NightingaleEFHHS12 fatcat:5ulzufamjnhnhblg53d6ibwtiq

We report the design, implementation, and deployment of Lepton, a fault-tolerant system that losslessly compresses JPEG images to 77% of their original size on average. ... Lepton matches the compression efficiency of the best prior work, while decoding more than nine times faster and in a streaming manner. ... KW's participation was as a paid consultant and was not part of his Stanford duties or responsibilities. ...

arXiv:1704.06192v1 fatcat:6n2sefbsnba3fbi4kmovpeyj54

Open Access

Acknowledgments While we draw from our direct involvement in Google's infrastructure design and operation over the past several years, most of what we have learned and now report here is the result of ... Thanks in advance for taking the time to contribute. ... The Autopilot system from Microsoft [87] offers an example design for some of this functionality for Windows Live datacenters. ...

doi:10.2200/s00516ed2v01y201306cac024 fatcat:435o455inbcmrakl6l7jp4gope

In this paper, we propose a comprehensive taxonomy of sustainable cloud computing. ... The usage of large number of cloud datacenters increases cost as well as carbon footprints, which further effects the sustainability of cloud services. ... datacenter powering systems and local flash storage with low power CPUs. ...

arXiv:1712.02899v2 fatcat:t26xxbgiijesneqgzi2mqz4gta

Multiple Versions

With an unprecedented pace of developments in Cloud computing technology, there has been an exponential increase of users of these services and an equal rise of cloud services providers. ... Clouding Computing is a virtual pool of resources provided to users as service through a web interface. These resources may include Software, Infrastructure, Storage, Network, Platform etc. ... Data Services (Storage, SQL Database, HDInsight, Cache, Backup, Recovery Manager) III. ...

doi:10.5120/17149-7184 fatcat:262geqqsqbet7gofec7wyrn2di

We report on experiences with Swift congestion control in Google datacenters. Swift targets an end-to-end delay by using AIMD control, with pacing under extreme congestion. ... In large-scale testbed experiments, Swift delivers a tail latency of <50µs for short RPCs, with near-zero packet drops, while sustaining ∼100Gbps throughput per server. ... Manya Ghobadi, Emily Blem, Vinh The Lam, Philip Wells and Ashish Naik contributed to the work in the early days of Swift. ...

doi:10.1145/3387514.3406591 dblp:conf/sigcomm/KumarDJWWMWSARW20 fatcat:ks4dtoy7hfdo5nerw6pcfmhcky

Citation

Gautam Kumar, Nandita Dukkipati, Keon Jang, Hassan M. G. Wassel, Xian Wu, Behnam Montazeri, Yaogong Wang, Kevin Springborn, Christopher Alfeld, Michael Ryan, David Wetherall, Amin Vahdat. "Swift." Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication (2020) 514-528

In a virtualized datacenter, the Service Level Agreement for an application restricts the Virtual Machines (VMs) placement. ... We propose a Byzantine fault tolerant pub/sub system, on a tree-based overlay, tolerating a configurable number of failures in any part of the system, with minimal divergence from traditional pub/sub specifications ... We thank the developers and users of SQLite and LevelDB for helping us understand their software in detail. ...

doi:10.1145/2524224.2524226 dblp:conf/hotdep/DangH13 fatcat:xe5soaxhengtxcufr5oahrxyoq

Live Recovery of Bit Corruptions in Datacenter Storage Systems [article]

Preserved Fulltext

Other Versions

Who's Afraid of Uncorrectable Bit Errors? Online Recovery of Flash Errors with Distributed Redundancy

Preserved Fulltext

Reliable Multi-cloud Storage Architecture Based on Erasure Code to Improve Storage Performance and Failure Recovery

Preserved Fulltext

Toward a high availability cloud: Techniques and challenges

Preserved Fulltext

PaxosStore

Preserved Fulltext

Datacenter Ethernet and RDMA: Issues at Hyperscale [article]

Preserved Fulltext

Proactive Detection and Repair of Data Corruption: Towards a Hassle-free Declarative Approach with Amulet

Preserved Fulltext

The RAMCloud Storage System

Preserved Fulltext

Flat Datacenter Storage

Preserved Fulltext

The Design, Implementation, and Deployment of a System to Transparently Compress Hundreds of Petabytes of Image Files for a File-Storage Service

Preserved Fulltext

The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second edition

Preserved Fulltext

A Taxonomy and Future Directions for Sustainable Cloud Computing: 360 Degree View [article]

Preserved Fulltext

Other Versions

A Survey on Security Mechanisms of Leading Cloud Service Providers

Preserved Fulltext

Swift

Preserved Fulltext

Higher SLA satisfaction in datacenters with continuous VM placement constraints

Preserved Fulltext