ABSTRACT
Cluster computing applications like MapReduce and Dryad transfer massive amounts of data between their computation stages. These transfers can have a significant impact on job performance, accounting for more than 50% of job completion times. Despite this impact, there has been relatively little work on optimizing the performance of these data transfers, with networking researchers traditionally focusing on per-flow traffic management. We address this limitation by proposing a global management architecture and a set of algorithms that (1) improve the transfer times of common communication patterns, such as broadcast and shuffle, and (2) allow scheduling policies at the transfer level, such as prioritizing a transfer over other transfers. Using a prototype implementation, we show that our solution improves broadcast completion times by up to 4.5X compared to the status quo in Hadoop. We also show that transfer-level scheduling can reduce the completion time of high-priority transfers by 1.7X.
Supplemental Material
- Amazon EC2. http://aws.amazon.com/ec2.Google Scholar
- Apache Hadoop. http://hadoop.apache.org.Google Scholar
- BitTornado. http://www.bittornado.com.Google Scholar
- BitTorrent. http://www.bittorrent.com.Google Scholar
- DETERlab. http://www.isi.deterlab.net.Google Scholar
- Fragment replicate join -- Pig wiki. http://wiki.apache.org/pig/PigFRJoin.Google Scholar
- LANTorrent. http://www.nimbusproject.org.Google Scholar
- Murder. http://github.com/lg/murder.Google Scholar
- H. Abu-Libdeh, P. Costa, A. Rowstron, G. O'Shea, and A. Donnelly. Symbiotic routing in future data centers. In SIGCOMM, pages 51--62, 2010. Google ScholarDigital Library
- M. Al-Fares, S. Radhakrishnan, B. Raghavan, N. Huang, and A. Vahdat. Hedera: Dynamic flow scheduling for data center networks. In NSDI, 2010. Google ScholarDigital Library
- G. Ananthanarayanan, S. Kandula, A. Greenberg, I. Stoica, Y. Lu, B. Saha, and E. Harris. Reining in the outliers in mapreduce clusters using Mantri. In OSDI, 2010. Google ScholarDigital Library
- M. Casado, M. J. Freedman, J. Pettit, J. Luo, N. McKeown, and S. Shenker. Ethane: Taking control of the enterprise. In SIGCOMM, pages 1--12, 2007. Google ScholarDigital Library
- M. Castro, P. Druschel, A.-M. Kermarrec, A. Nandi, A. Rowstron, and A. Singh. Splitstream: high-bandwidth multicast in cooperative environments. In SOSP, 2003. Google ScholarDigital Library
- Y. Chen, R. Griffith, J. Liu, R. H. Katz, and A. D. Joseph. Understanding TCP incast throughput collapse in datacenter networks. In WREN, pages 73--82, 2009. Google ScholarDigital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004. Google ScholarDigital Library
- C. Diot, W. Dabbous, and J. Crowcroft. Multipoint communication: A survey of protocols, functions, and mechanisms. IEEE JSAC, 15(3):277--290, 1997. Google ScholarDigital Library
- B. Donnet, B. Gueye, and M. A. Kaafar. A Survey on Network Coordinates Systems, Design, and Security. IEEE Communication Surveys and Tutorials, 12(4), Oct. 2010. Google ScholarDigital Library
- C. Fraley and A. Raftery. MCLUST Version 3 for R: Normal mixture modeling and model-based clustering. Technical Report 504, Department of Statistics, University of Washington, Sept. 2006.Google Scholar
- P. Ganesan and M. Seshadri. On cooperative content distribution and the price of barter. In ICDCS, 2005. Google ScholarDigital Library
- C. Gkantsidis, T. Karagiannis, and M. VojnoviC. Planet scale software updates. In SIGCOMM, pages 423--434, 2006. Google ScholarDigital Library
- A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. A. Maltz, P. Patel, and S. Sengupta. VL2: A scalable and flexible data center network. In SIGCOMM, 2009. Google ScholarDigital Library
- A. Greenberg, G. Hjalmtysson, D. A. Maltz, A. Myers, J. Rexford, G. Xie, H. Yan, J. Zhan, and H. Zhang. A clean slate 4D approach to network control and management. SIGCOMM CCR, 35:41--54, 2005. Google ScholarDigital Library
- C. Guo, G. Lu, D. Li, H. Wu, X. Zhang, Y. Shi, C. Tian, Y. Zhang, and S. Lu. BCube: A high performance, server-centric network architecture for modular data centers. In SIGCOMM, pages 63--74, 2009. Google ScholarDigital Library
- C. Guo, H. Wu, K. Tan, L. Shi, Y. Zhang, and S. Lu. DCell: A scalable and fault-tolerant network structure for data centers. In SIGCOMM, pages 75--86, 2008. Google ScholarDigital Library
- T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York, NY, 2009.Google ScholarCross Ref
- B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. Joseph, R. Katz, S. Shenker, and I. Stoica. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. In NSDI, 2011. Google ScholarDigital Library
- U. Hoelzle and L. A. Barroso. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Morgan and Claypool Publishers, 1st edition, 2009. Google ScholarDigital Library
- M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In EuroSys, pages 59--72, 2007. Google ScholarDigital Library
- M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg. Quincy: Fair scheduling for distributed computing clusters. In SOSP, 2009. Google ScholarDigital Library
- D. A. Joseph, A. Tavakoli, and I. Stoica. A policy-aware switching layer for data centers. In SIGCOMM, 2008. Google ScholarDigital Library
- J. B. Kruskal and M. Wish. Multidimensional Scaling. Sage University Paper series on Quantitative Applications in the Social Sciences, 07-001, 1978.Google ScholarCross Ref
- G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A system for large-scale graph processing. In SIGMOD, 2010. Google ScholarDigital Library
- Y. Mao and L. K. Saul. Modeling Distances in Large-Scale Networks by Matrix Factorization. In IMC, 2004. Google ScholarDigital Library
- D. G. Murray, M. Schwarzkopf, C. Smowton, S. Smith, A. Madhavapeddy, and S. Hand. Ciel: A Universal Execution Engine for Distributed Data-Flow Computing. In NSDI, 2011. Google ScholarDigital Library
- R. N. Mysore, A. Pamboris, N. Farrington, N. Huang, P. Miri, S. Radhakrishnan, V. Subramanya, and A. Vahdat. PortLand: A scalable fault-tolerant layer 2 data center network fabric. In SIGCOMM, pages 39--50, 2009. Google ScholarDigital Library
- R. Peterson and E. G. Sirer. Antfarm: Efficient content distribution with managed swarms. In NSDI, 2009. Google ScholarDigital Library
- B. Pfaff, J. Pettit, K. Amidon, M. Casado, T. Koponen, and S. Shenker. Extending networking into the virtualization layer. In HotNets 2009.Google Scholar
- A. Shieh, S. Kandula, A. Greenberg, and C. Kim. Sharing the data center network. In NSDI, 2011. Google ScholarDigital Library
- D. B. Shmoys. Cut problems and their application to divide-and-conquer, chapter 5, pages 192--235. PWS Publishing Co., Boston, MA, USA, 1997. Google ScholarDigital Library
- K. Thomas, C. Grier, J. Ma, V. Paxson, and D. Song. Design and evaluation of a real-time URL spam filtering service. In IEEE Symposium on Security and Privacy, 2011. Google ScholarDigital Library
- V. Vasudevan, A. Phanishayee, H. Shah, E. Krevat, D. G. Andersen, G. R. Ganger, G. A. Gibson, and B. Mueller. Safe and effective fine-grained TCP retransmissions for datacenter communication. In SIGCOMM, pages 303--314, 2009. Google ScholarDigital Library
- H. Yan, D. A. Maltz, T. S. E. Ng, H. Gogineni, H. Zhang, and Z. Cai. Tesseract: A 4D network control plane. In NSDI '07. Google ScholarDigital Library
- M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoica. Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. In EuroSys, 2010. Google ScholarDigital Library
- M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster Computing with Working Sets. In HotCloud, 2010. Google ScholarDigital Library
- Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan. Large-scale parallel collaborative filtering for the Netflix prize. In AAIM, pages 337--348. Springer-Verlag, 2008. Google ScholarDigital Library
Index Terms
- Managing data transfers in computer clusters with orchestra
Recommendations
Efficient Coflow Scheduling Without Prior Knowledge
SIGCOMM '15: Proceedings of the 2015 ACM Conference on Special Interest Group on Data CommunicationInter-coflow scheduling improves application-level communication performance in data-parallel clusters. However, existing efficient schedulers require a priori coflow information and ignore cluster dynamics like pipelining, task failures, and ...
Efficient coflow scheduling with Varys
SIGCOMM '14: Proceedings of the 2014 ACM conference on SIGCOMMCommunication in data-parallel applications often involves a collection of parallel flows. Traditional techniques to optimize flow-level metrics do not perform well in optimizing such collections, because the network is largely agnostic to application-...
Managing data transfers in computer clusters with orchestra
SIGCOMM '11Cluster computing applications like MapReduce and Dryad transfer massive amounts of data between their computation stages. These transfers can have a significant impact on job performance, accounting for more than 50% of job completion times. Despite ...
Comments