Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3190508.3190534acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
research-article
Open Access

Riffle: optimized shuffle service for large-scale data analytics

Published:23 April 2018Publication History

ABSTRACT

The rapidly growing size of data and complexity of analytics present new challenges for large-scale data processing systems. Modern systems keep data partitions in memory for pipelined operators, and persist data across stages with wide dependencies on disks for fault tolerance. While processing can often scale well by splitting jobs into smaller tasks for better parallelism, all-to-all data transfer---called shuffle operations---become the scaling bottleneck when running many small tasks in multi-stage data analytics jobs. Our key observation is that this bottleneck is due to the superlinear increase in disk I/O operations as data volume increases.

We present Riffle, an optimized shuffle service for big-data analytics frameworks that significantly improves I/O efficiency and scales to process petabytes of data. To do so, Riffle efficiently merges fragmented intermediate shuffle files into larger block files, and thus converts small, random disk I/O requests into large, sequential ones. Riffle further improves performance and fault tolerance by mixing both merged and unmerged block files to minimize merge operation overhead. Using Riffle, Facebook production jobs on Spark clusters with over 1,000 executors experience up to a 10x reduction in the number of shuffle I/O requests and 40% improvement in the end-to-end job completion time.

References

  1. Retrieved 10/20/2017. Apache Hadoop. (Retrieved 10/20/2017). http://hadoop.apache.org/.Google ScholarGoogle Scholar
  2. Retrieved 10/20/2017. Apache Ignite. (Retrieved 10/20/2017). https://ignite.apache.org/.Google ScholarGoogle Scholar
  3. Retrieved 10/20/2017. Apache Spark. (Retrieved 10/20/2017). http://spark.apache.org/.Google ScholarGoogle Scholar
  4. Retrieved 10/20/2017. Apache Spark Performance Tuning âĂŞ Degree of Parallelism. (Retrieved 10/20/2017). https://goo.gl/Mpt13F.Google ScholarGoogle Scholar
  5. Retrieved 10/20/2017. Apache Spark @Scale: A 60 TB+ Production Use Case. (Retrieved 10/20/2017). https://code.facebook.com/posts/1671373793181703/.Google ScholarGoogle Scholar
  6. Retrieved 10/20/2017. Apache Spark the fastest open source engine for sorting a petabyte. (Retrieved 10/20/2017). https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html.Google ScholarGoogle Scholar
  7. Retrieved 10/20/2017. Facebook Disaggregate: Networking recap. (Retrieved 10/20/2017). https://code.facebook.com/posts/1887543398133443/.Google ScholarGoogle Scholar
  8. Retrieved 10/20/2017. Facebook's Disaggregate Storage and Compute for Map/Reduce. (Retrieved 10/20/2017). https://goo.gl/8vQdfU.Google ScholarGoogle Scholar
  9. Retrieved 10/20/2017. LZ4: Extremely Fast Compression Algorithm. (Retrieved 10/20/2017). http://www.lz4.org.Google ScholarGoogle Scholar
  10. Retrieved 10/20/2017. MapReduce-4049: Plugin for Generic Shuffle Service. (Retrieved 10/20/2017). https://issues.apache.org/jira/browse/MAPREDUCE-4049.Google ScholarGoogle Scholar
  11. Retrieved 10/20/2017. Snappy: A Fast Compressor/Decompressor. (Retrieved 10/20/2017). https://google.github.io/snappy/.Google ScholarGoogle Scholar
  12. Retrieved 10/20/2017. Spark Configuration: External Shuffle Service. (Retrieved 10/20/2017). https://spark.apache.org/docs/latest/job-scheduling.html.Google ScholarGoogle Scholar
  13. Retrieved 10/20/2017. Tim Sort. (Retrieved 10/20/2017). http://wiki.c2.com/?TimSort.Google ScholarGoogle Scholar
  14. Retrieved 10/20/2017. Working with Apache Spark. (Retrieved 10/20/2017). https://goo.gl/XbUA42.Google ScholarGoogle Scholar
  15. Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. 2013. BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. In ACM EuroSys. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. 2013. Effective Straggler Mitigation: Attack of the Clones. In USENIX NSDI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Ganesh Ananthanarayanan, Michael Chien-Chun Hung, Xiaoqi Ren, Ion Stoica, Adam Wierman, and Minlan Yu. 2014. GRASS: Trimming Stragglers in Approximation Analytics. In USENIX NSDI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha, and Edward Harris. 2010. Reining in the Outliers in Map-reduce Clusters Using Mantri. In USENIX OSDI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational Data Processing in Spark. In ACM SIGMOD. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Josep Lluís Berral, Nicolas Poggi, David Carrera, Aaron Call, Rob Reinauer, and Daron Green. 2015. ALOJA-ML: A Framework for Automating Characterization and Knowledge Discovery in Hadoop Deployments. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein, Khaled Elmeleegy, and Russell Sears. 2010. MapReduce Online. In USENIX NSDI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters. In USENIX OSDI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Shark: Fast Data Analysis Using Coarse-grained Distributed Memory. In ACM SIGMOD. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Peter X. Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, Sangjin Han, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. 2016. Network Requirements for Resource Disaggregation. In USENIX OSDI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google File System. In ACM SOSP. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, and Ion Stoica. 2011. Dominant Resource Fairness: Fair Allocation of Multiple Resource Types. In USENIX NSDI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Laura M. Grupp, John D. Davis, and Steven Swanson. 2012. The Bleak Future of NAND Flash Memory. In USENIX FAST. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Herodotos Herodotou, Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong, Fatma Bilgen Cetin, and Shivnath Babu. 2011. Starfish: A Self-tuning System for Big Data Analytics. In CIDR. 261--272.Google ScholarGoogle Scholar
  29. Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, and Ion Stoica. 2011. Mesos: A Platform for Fine-grained Resource Sharing in the Data Center. In USENIX NSDI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Qi Huang, Petchean Ang, Peter Knowles, Tomasz Nykiel, Iaroslav Tverdokhlib, Amit Yajurvedi, Paul Dapolito VI, Xifan Yan, Maxim Bykov, Chuen Liang, Mohit Talwar, Abhishek Mathur, Sachin Kulkarni, Matthew Burke, and Wyatt Lloyd. 2017. SVE: Distributed Video Processing at Facebook Scale. In ACM SOSP. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: Distributed Data-parallel Programs from Sequential Building Blocks. In ACM EuroSys. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. S. Kambhampati, J. Kelley, C. Stewart, W. C. L. Stewart, and R. Ramnath. 2014. Managing Tiny Tasks for Data-Parallel, Subsampling Workloads. In 2014 IEEE International Conference on Cloud Engineering. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Vamsee Kasavajhala. 2011. Solid State Drive vs. Hard Disk Drive Price and Performance Study: A Dell Technical White Paper. Dell Power Vault Storage Systems (May 2011).Google ScholarGoogle Scholar
  34. Soila Kavulya, Jiaqi Tan, Rajeev Gandhi, and Priya Narasimhan. 2010. An Analysis of Traces from a Production MapReduce Cluster. In IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (CCGrid). Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Haoyuan Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, and Ion Stoica. 2014. Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks. In ACM SoCC. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Kevin Lim, Jichuan Chang, Trevor Mudge, Parthasarathy Ranganathan, Steven K. Reinhardt, and Thomas F. Wenisch. 2009. Disaggregated Memory for Expansion and Sharing in Blade Servers. In ACM ISCA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. David Lion, Adrian Chiu, Hailong Sun, Xin Zhuang, Nikola Grcevski, and Ding Yuan. 2016. Don't Get Caught in the Cold, Warm-up Your JVM: Understand and Eliminate JVM Warm-up Overhead in Data-Parallel Systems. In USENIX OSDI. Savannah, GA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. S. T. Maguluri, R. Srikant, and L. Ying. 2012. Stochastic Models of Load Balancing and Scheduling in Cloud Computing Clusters. In IEEE INFOCOM.Google ScholarGoogle Scholar
  39. M. D. McKay, R. J. Beckman, and W. J. Conover. 2000. A Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output from a Computer Code. Technometrics 42, 1 (Feb. 2000), 55--61. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Michael Mitzenmacher. 2001. The Power of Two Choices in Randomized Load Balancing. IEEE Transactions on Parallel and Distributed Systems 12, 10 (Oct. 2001), 1094--1104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Kay Ousterhout, Christopher Canel, Sylvia Ratnasamy, and Scott Shenker. 2017. Monotasks: Architecting for Performance Clarity in Data Analytics Frameworks. In ACM SOSP. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Kay Ousterhout, Aurojit Panda, Joshua Rosen, Shivaram Venkataraman, Reynold Xin, Sylvia Ratnasamy, Scott Shenker, and Ion Stoica. 2013. The Case for Tiny Tasks in Compute Clusters. In USENIX HotOS Workshop. Santa Ana Pueblo, NM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, and Byung-Gon Chun. 2015. Making Sense of Performance in Data Analytics Frameworks. In USENIX NSDI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Kay Ousterhout, Patrick Wendell, Matei Zaharia, and Ion Stoica. 2013. Sparrow: Distributed, Low Latency Scheduling. In ACM SOSP. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Sriram Rao, Raghu Ramakrishnan, Adam Silberstein, Mike Ovsiannikov, and Damian Reeves. 2012. Sailfish: A Framework for Large Scale Data Processing. In ACM SoCC. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. A. Rasmussen, M. Conley, R. Kapoor, V.T. Lam, G. Porter, and A. Vahdat. 2012. ThemisMR: An I/O-efficient MapReduce. Technical Report (University of California, San Diego. Department of Computer Science and Engineering) (2012).Google ScholarGoogle Scholar
  47. Alexander Rasmussen, Vinh The Lam, Michael Conley, George Porter, Rishi Kapoor, and Amin Vahdat. 2012. Themis: An I/O-efficient MapReduce. In ACM SoCC. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Alexander Rasmussen, George Porter, Michael Conley, Harsha V. Madhyastha, Radhika Niranjan Mysore, Alexander Pucher, and Amin Vahdat. 2011. TritonSort: A Balanced Large-scale Sorting System. In USENIX NSDI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The Hadoop Distributed File System. In IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST). Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Patrick Stuedi, Animesh Trivedi, Jonas Pfefferle, Radu Stoica, Bernard Metzler, Nikolas Ioannou, and Ioannis Koltsidas. 2017. Crail: A High-Performance I/O Architecture for Distributed Data Processing. IEEE Data Eng. Bull. 40, 1 (2017), 38--49.Google ScholarGoogle Scholar
  51. Shivaram Venkataraman, Aurojit Panda, Ganesh Ananthanarayanan, Michael J. Franklin, and Ion Stoica. 2014. The Power of Choice in Data-aware Cluster Scheduling. In USENIX OSDI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Kashi Venkatesh Vishwanath and Nachiappan Nagappan. 2010. Characterizing Cloud Computing Hardware Reliability. In ACM SoCC. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Y. Wang, R. Goldstone, W. Yu, and T. Wang. 2014. Characterization and Optimization of Memory-Resident MapReduce on HPC Systems. In IEEE 28th International Parallel and Distributed Processing Symposium. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Yandong Wang, Xinyu Que, Weikuan Yu, Dror Goldenberg, and Dhiraj Sehgal. 2011. Hadoop Acceleration Through Network Levitated Merge. In Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Caesar Wu and Rajkumar Buyya. 2015. Cloud Data Centers and Cost Modeling: A Complete Guide To Planning, Designing and Building a Cloud Data Center (1st ed.). Morgan Kaufmann Publishers Inc. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Tao Ye and Shivkumar Kalyanaraman. 2003. A Recursive Random Search Algorithm for Large-scale Network Parameter Configuration.Google ScholarGoogle Scholar
  57. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In USENIX NSDI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, and Ion Stoica. 2008. Improving MapReduce Performance in Heterogeneous Environments. In USENIX OSDI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Yuqing Zhu, Jianxun Liu, Mengying Guo, Yungang Bao, Wenlong Ma, Zhuoyue Liu, Kunpeng Song, and Yingchun Yang. 2017. BestConfig: Tapping the Performance Potential of Systems via Automatic Configuration Tuning. In ACM SoCC. Santa Clara, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Riffle: optimized shuffle service for large-scale data analytics

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          EuroSys '18: Proceedings of the Thirteenth EuroSys Conference
          April 2018
          631 pages
          ISBN:9781450355841
          DOI:10.1145/3190508

          Copyright © 2018 Owner/Author

          Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 23 April 2018

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          EuroSys '18 Paper Acceptance Rate43of262submissions,16%Overall Acceptance Rate241of1,308submissions,18%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader