Riffle: optimized shuffle service for large-scale data analytics

Authors:
Haoyu Zhang

Princeton University

Princeton University
View Profile

,
Brian Cho

Facebook, Inc.

Facebook, Inc.
View Profile

,
Ergin Seyfe

Facebook, Inc.

Facebook, Inc.
View Profile

,
Avery Ching

Facebook, Inc.

Facebook, Inc.
View Profile

,
Michael J. Freedman

Princeton University

Princeton University
View Profile

EuroSys '18: Proceedings of the Thirteenth EuroSys ConferenceApril 2018Article No.: 43Pages 1–15https://doi.org/10.1145/3190508.3190534

Published:23 April 2018Publication History

EuroSys '18: Proceedings of the Thirteenth EuroSys Conference

Pages 1–15

ABSTRACT

The rapidly growing size of data and complexity of analytics present new challenges for large-scale data processing systems. Modern systems keep data partitions in memory for pipelined operators, and persist data across stages with wide dependencies on disks for fault tolerance. While processing can often scale well by splitting jobs into smaller tasks for better parallelism, all-to-all data transfer---called shuffle operations---become the scaling bottleneck when running many small tasks in multi-stage data analytics jobs. Our key observation is that this bottleneck is due to the superlinear increase in disk I/O operations as data volume increases.

We present Riffle, an optimized shuffle service for big-data analytics frameworks that significantly improves I/O efficiency and scales to process petabytes of data. To do so, Riffle efficiently merges fragmented intermediate shuffle files into larger block files, and thus converts small, random disk I/O requests into large, sequential ones. Riffle further improves performance and fault tolerance by mixing both merged and unmerged block files to minimize merge operation overhead. Using Riffle, Facebook production jobs on Spark clusters with over 1,000 executors experience up to a 10x reduction in the number of shuffle I/O requests and 40% improvement in the end-to-end job completion time.

References

Retrieved 10/20/2017. Apache Hadoop. (Retrieved 10/20/2017). http://hadoop.apache.org/.Google Scholar
Retrieved 10/20/2017. Apache Ignite. (Retrieved 10/20/2017). https://ignite.apache.org/.Google Scholar
Retrieved 10/20/2017. Apache Spark. (Retrieved 10/20/2017). http://spark.apache.org/.Google Scholar
Retrieved 10/20/2017. Apache Spark Performance Tuning âĂŞ Degree of Parallelism. (Retrieved 10/20/2017). https://goo.gl/Mpt13F.Google Scholar
Retrieved 10/20/2017. Apache Spark @Scale: A 60 TB+ Production Use Case. (Retrieved 10/20/2017). https://code.facebook.com/posts/1671373793181703/.Google Scholar
Retrieved 10/20/2017. Apache Spark the fastest open source engine for sorting a petabyte. (Retrieved 10/20/2017). https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html.Google Scholar
Retrieved 10/20/2017. Facebook Disaggregate: Networking recap. (Retrieved 10/20/2017). https://code.facebook.com/posts/1887543398133443/.Google Scholar
Retrieved 10/20/2017. Facebook's Disaggregate Storage and Compute for Map/Reduce. (Retrieved 10/20/2017). https://goo.gl/8vQdfU.Google Scholar
Retrieved 10/20/2017. LZ4: Extremely Fast Compression Algorithm. (Retrieved 10/20/2017). http://www.lz4.org.Google Scholar
Retrieved 10/20/2017. MapReduce-4049: Plugin for Generic Shuffle Service. (Retrieved 10/20/2017). https://issues.apache.org/jira/browse/MAPREDUCE-4049.Google Scholar
Retrieved 10/20/2017. Snappy: A Fast Compressor/Decompressor. (Retrieved 10/20/2017). https://google.github.io/snappy/.Google Scholar
Retrieved 10/20/2017. Spark Configuration: External Shuffle Service. (Retrieved 10/20/2017). https://spark.apache.org/docs/latest/job-scheduling.html.Google Scholar
Retrieved 10/20/2017. Tim Sort. (Retrieved 10/20/2017). http://wiki.c2.com/?TimSort.Google Scholar
Retrieved 10/20/2017. Working with Apache Spark. (Retrieved 10/20/2017). https://goo.gl/XbUA42.Google Scholar
Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. 2013. BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. In ACM EuroSys. Google ScholarDigital Library
Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. 2013. Effective Straggler Mitigation: Attack of the Clones. In USENIX NSDI. Google ScholarDigital Library
Ganesh Ananthanarayanan, Michael Chien-Chun Hung, Xiaoqi Ren, Ion Stoica, Adam Wierman, and Minlan Yu. 2014. GRASS: Trimming Stragglers in Approximation Analytics. In USENIX NSDI. Google ScholarDigital Library
Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha, and Edward Harris. 2010. Reining in the Outliers in Map-reduce Clusters Using Mantri. In USENIX OSDI. Google ScholarDigital Library
Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational Data Processing in Spark. In ACM SIGMOD. Google ScholarDigital Library
Josep Lluís Berral, Nicolas Poggi, David Carrera, Aaron Call, Rob Reinauer, and Daron Green. 2015. ALOJA-ML: A Framework for Automating Characterization and Knowledge Discovery in Hadoop Deployments. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Google ScholarDigital Library
Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein, Khaled Elmeleegy, and Russell Sears. 2010. MapReduce Online. In USENIX NSDI. Google ScholarDigital Library
Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters. In USENIX OSDI. Google ScholarDigital Library
Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Shark: Fast Data Analysis Using Coarse-grained Distributed Memory. In ACM SIGMOD. Google ScholarDigital Library
Peter X. Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, Sangjin Han, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. 2016. Network Requirements for Resource Disaggregation. In USENIX OSDI. Google ScholarDigital Library
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google File System. In ACM SOSP. Google ScholarDigital Library
Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, and Ion Stoica. 2011. Dominant Resource Fairness: Fair Allocation of Multiple Resource Types. In USENIX NSDI. Google ScholarDigital Library
Laura M. Grupp, John D. Davis, and Steven Swanson. 2012. The Bleak Future of NAND Flash Memory. In USENIX FAST. Google ScholarDigital Library
Herodotos Herodotou, Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong, Fatma Bilgen Cetin, and Shivnath Babu. 2011. Starfish: A Self-tuning System for Big Data Analytics. In CIDR. 261--272.Google Scholar
Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, and Ion Stoica. 2011. Mesos: A Platform for Fine-grained Resource Sharing in the Data Center. In USENIX NSDI. Google ScholarDigital Library
Qi Huang, Petchean Ang, Peter Knowles, Tomasz Nykiel, Iaroslav Tverdokhlib, Amit Yajurvedi, Paul Dapolito VI, Xifan Yan, Maxim Bykov, Chuen Liang, Mohit Talwar, Abhishek Mathur, Sachin Kulkarni, Matthew Burke, and Wyatt Lloyd. 2017. SVE: Distributed Video Processing at Facebook Scale. In ACM SOSP. Google ScholarDigital Library
Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: Distributed Data-parallel Programs from Sequential Building Blocks. In ACM EuroSys. Google ScholarDigital Library
S. Kambhampati, J. Kelley, C. Stewart, W. C. L. Stewart, and R. Ramnath. 2014. Managing Tiny Tasks for Data-Parallel, Subsampling Workloads. In 2014 IEEE International Conference on Cloud Engineering. Google ScholarDigital Library
Vamsee Kasavajhala. 2011. Solid State Drive vs. Hard Disk Drive Price and Performance Study: A Dell Technical White Paper. Dell Power Vault Storage Systems (May 2011).Google Scholar
Soila Kavulya, Jiaqi Tan, Rajeev Gandhi, and Priya Narasimhan. 2010. An Analysis of Traces from a Production MapReduce Cluster. In IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (CCGrid). Google ScholarDigital Library
Haoyuan Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, and Ion Stoica. 2014. Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks. In ACM SoCC. Google ScholarDigital Library
Kevin Lim, Jichuan Chang, Trevor Mudge, Parthasarathy Ranganathan, Steven K. Reinhardt, and Thomas F. Wenisch. 2009. Disaggregated Memory for Expansion and Sharing in Blade Servers. In ACM ISCA. Google ScholarDigital Library
David Lion, Adrian Chiu, Hailong Sun, Xin Zhuang, Nikola Grcevski, and Ding Yuan. 2016. Don't Get Caught in the Cold, Warm-up Your JVM: Understand and Eliminate JVM Warm-up Overhead in Data-Parallel Systems. In USENIX OSDI. Savannah, GA. Google ScholarDigital Library
S. T. Maguluri, R. Srikant, and L. Ying. 2012. Stochastic Models of Load Balancing and Scheduling in Cloud Computing Clusters. In IEEE INFOCOM.Google Scholar
M. D. McKay, R. J. Beckman, and W. J. Conover. 2000. A Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output from a Computer Code. Technometrics 42, 1 (Feb. 2000), 55--61. Google ScholarDigital Library
Michael Mitzenmacher. 2001. The Power of Two Choices in Randomized Load Balancing. IEEE Transactions on Parallel and Distributed Systems 12, 10 (Oct. 2001), 1094--1104. Google ScholarDigital Library
Kay Ousterhout, Christopher Canel, Sylvia Ratnasamy, and Scott Shenker. 2017. Monotasks: Architecting for Performance Clarity in Data Analytics Frameworks. In ACM SOSP. Google ScholarDigital Library
Kay Ousterhout, Aurojit Panda, Joshua Rosen, Shivaram Venkataraman, Reynold Xin, Sylvia Ratnasamy, Scott Shenker, and Ion Stoica. 2013. The Case for Tiny Tasks in Compute Clusters. In USENIX HotOS Workshop. Santa Ana Pueblo, NM. Google ScholarDigital Library
Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, and Byung-Gon Chun. 2015. Making Sense of Performance in Data Analytics Frameworks. In USENIX NSDI. Google ScholarDigital Library
Kay Ousterhout, Patrick Wendell, Matei Zaharia, and Ion Stoica. 2013. Sparrow: Distributed, Low Latency Scheduling. In ACM SOSP. Google ScholarDigital Library
Sriram Rao, Raghu Ramakrishnan, Adam Silberstein, Mike Ovsiannikov, and Damian Reeves. 2012. Sailfish: A Framework for Large Scale Data Processing. In ACM SoCC. Google ScholarDigital Library
A. Rasmussen, M. Conley, R. Kapoor, V.T. Lam, G. Porter, and A. Vahdat. 2012. ThemisMR: An I/O-efficient MapReduce. Technical Report (University of California, San Diego. Department of Computer Science and Engineering) (2012).Google Scholar
Alexander Rasmussen, Vinh The Lam, Michael Conley, George Porter, Rishi Kapoor, and Amin Vahdat. 2012. Themis: An I/O-efficient MapReduce. In ACM SoCC. Google ScholarDigital Library
Alexander Rasmussen, George Porter, Michael Conley, Harsha V. Madhyastha, Radhika Niranjan Mysore, Alexander Pucher, and Amin Vahdat. 2011. TritonSort: A Balanced Large-scale Sorting System. In USENIX NSDI. Google ScholarDigital Library
Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The Hadoop Distributed File System. In IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST). Google ScholarDigital Library
Patrick Stuedi, Animesh Trivedi, Jonas Pfefferle, Radu Stoica, Bernard Metzler, Nikolas Ioannou, and Ioannis Koltsidas. 2017. Crail: A High-Performance I/O Architecture for Distributed Data Processing. IEEE Data Eng. Bull. 40, 1 (2017), 38--49.Google Scholar
Shivaram Venkataraman, Aurojit Panda, Ganesh Ananthanarayanan, Michael J. Franklin, and Ion Stoica. 2014. The Power of Choice in Data-aware Cluster Scheduling. In USENIX OSDI. Google ScholarDigital Library
Kashi Venkatesh Vishwanath and Nachiappan Nagappan. 2010. Characterizing Cloud Computing Hardware Reliability. In ACM SoCC. Google ScholarDigital Library
Y. Wang, R. Goldstone, W. Yu, and T. Wang. 2014. Characterization and Optimization of Memory-Resident MapReduce on HPC Systems. In IEEE 28th International Parallel and Distributed Processing Symposium. Google ScholarDigital Library
Yandong Wang, Xinyu Que, Weikuan Yu, Dror Goldenberg, and Dhiraj Sehgal. 2011. Hadoop Acceleration Through Network Levitated Merge. In Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. Google ScholarDigital Library
Caesar Wu and Rajkumar Buyya. 2015. Cloud Data Centers and Cost Modeling: A Complete Guide To Planning, Designing and Building a Cloud Data Center (1st ed.). Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
Tao Ye and Shivkumar Kalyanaraman. 2003. A Recursive Random Search Algorithm for Large-scale Network Parameter Configuration.Google Scholar
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In USENIX NSDI. Google ScholarDigital Library
Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, and Ion Stoica. 2008. Improving MapReduce Performance in Heterogeneous Environments. In USENIX OSDI. Google ScholarDigital Library
Yuqing Zhu, Jianxun Liu, Mengying Guo, Yungang Bao, Wenlong Ma, Zhuoyue Liu, Kunpeng Song, and Yingchun Yang. 2017. BestConfig: Tapping the Performance Potential of Systems via Automatic Configuration Tuning. In ACM SoCC. Santa Clara, CA. Google ScholarDigital Library

Index Terms

Riffle: optimized shuffle service for large-scale data analytics

Recommendations

Run-time performance optimization of a BigData query language
ICPE '14: Proceedings of the 5th ACM/SPEC international conference on Performance engineering

JAQL is a query language for large-scale data that connects BigData analytics and MapReduce framework together. Also an IBM product, JAQL's performance is critical for IBM InfoSphere BigInsights, a BigData analytics platform. In this paper, we report ...
Read More
A survey of big data management

The rapid growth of emerging applications and the evolution of cloud computing technologies have significantly enhanced the capability to generate vast amounts of data. Thus, it has become a great challenge in this big data era to manage such voluminous ...
Read More
Optimizing the Hadoop MapReduce Framework with high-performance storage devices

Solid-state drives (SSDs) are an attractive alternative to hard disk drives (HDDs) to accelerate the Hadoop MapReduce Framework. However, the SSD characteristics and today's Hadoop framework exhibit mismatches that impede indiscriminate SSD integration. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EuroSys '18: Proceedings of the Thirteenth EuroSys Conference
April 2018
631 pages
ISBN:9781450355841
DOI:10.1145/3190508
General Chair:
Rui Oliveira,
Program Chairs:
Pascal Felber,
Y. Charlie Hu
Copyright © 2018 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 April 2018
Check for updates
Author Tags
I/O optimization
big-data analytics frameworks
shuffle service
storage
Qualifiers
- research-article
Conference

Acceptance Rates
EuroSys '18 Paper Acceptance Rate43of262submissions,16%Overall Acceptance Rate241of1,308submissions,18%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 42
  Total Citations
  View Citations
- 2,249
  Total Downloads
- Downloads (Last 12 months)236
- Downloads (Last 6 weeks)27
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Riffle: optimized shuffle service for large-scale data analytics

EuroSys '18: Proceedings of the Thirteenth EuroSys Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Run-time performance optimization of a BigData query language

A survey of big data management

Optimizing the Hadoop MapReduce Framework with high-performance storage devices