Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2038916.2038934acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics

Published:26 October 2011Publication History

ABSTRACT

Infrastructure-as-a-Service (IaaS) cloud platforms have brought two unprecedented changes to cluster provisioning practices. First, any (nonexpert) user can provision a cluster of any size on the cloud within minutes to run her data-processing jobs. The user can terminate the cluster once her jobs complete, and she needs to pay only for the resources used and duration of use. Second, cloud platforms enable users to bypass the traditional middleman---the system administrator---in the cluster-provisioning process. These changes give tremendous power to the user, but place a major burden on her shoulders. The user is now faced regularly with complex cluster sizing problems that involve finding the cluster size, the type of resources to use in the cluster from the large number of choices offered by current IaaS cloud platforms, and the job configurations that best meet the performance needs of her workload.

In this paper, we introduce the Elastisizer, a system to which users can express cluster sizing problems as queries in a declarative fashion. The Elastisizer provides reliable answers to these queries using an automated technique that uses a mix of job profiling, estimation using black-box and white-box models, and simulation. We have prototyped the Elastisizer for the Hadoop MapReduce framework, and present a comprehensive evaluation that shows the benefits of the Elastisizer in common scenarios where cluster sizing problems arise.

References

  1. Amazon Elastic MapReduce. http://aws.amazon.com/elasticmapreduce.Google ScholarGoogle Scholar
  2. S. Babu. Towards Automatic Optimization of MapReduce Programs. In SOCC, pages 137--142, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. Bodik, R. Griffith, C. Sutton, A. Fox, M. Jordan, and D. Patterson. Statistical Machine Learning Makes Automatic Control Practical for Internet Datacenters. In HotCloud, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Facebook on Hadoop, Hive, HBase, and A/B Testing. http://tinyurl.com/3dsdsh4.Google ScholarGoogle Scholar
  5. BTrace: A Dynamic Instrumentation Tool for Java. http://kenai.com/projects/btrace.Google ScholarGoogle Scholar
  6. B. M. Cantrill, M. W. Shapiro, and A. H. Leventhal. Dynamic Instrumentation of Production Systems. In USENIX ATEC, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Chaudhuri, P. Ganesan, and V. R. Narasayya. Primitives for Workload Summarization and Implications for SQL. In VLDB, pages 730--741, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. N. Chohan, C. Castillo, M. Spreitzer, M. Steinder, A. Tantawi, and C. Krintz. See Spot Run: Using Spot Instances for MapReduce Workflows. In HotCloud, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Cloudera: 7 tips for Improving MapReduce Performance. http://www.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/.Google ScholarGoogle Scholar
  10. J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM, 51(1):107--113, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Duan, V. Thummala, and S. Babu. Tuning Database Configuration Parameters with iTuned. PVLDB, 2(1):1246--1257, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Hamilton. Resource Consumption Shaping. http://tinyurl.com/4m9vch.Google ScholarGoogle Scholar
  13. H. Herodotou. Hadoop Performance Models. Technical report, Duke Univ., 2010. http://www.cs.duke.edu/starfish/files/hadoop-models.pdf.Google ScholarGoogle Scholar
  14. H. Herodotou and S. Babu. Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs. PVLDB, 4, 2011.Google ScholarGoogle Scholar
  15. H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, and S. Babu. Starfish: A Self-tuning System for Big Data Analytics. In CIDR, pages 261--272, 2011.Google ScholarGoogle Scholar
  16. M.-Y. Iu and W. Zwaenepoel. HadoopToSQL: A MapReduce Query Optimizer. In EuroSys, pages 251--264, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. E. Jahani, M. J. Cafarella, and C. Ré. Automatic Optimization of MapReduce Programs. PVLDB, 4:386--396, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. K. Kambatla, A. Pathak, and H. Pucha. Towards Optimizing Hadoop Provisioning in the Cloud. In HotCloud, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. Li, X. Yang, S. Kandula, and M. Zhang. CloudCmp: Shopping for a Cloud Made Easy. In HotCloud, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Lin and C. Dyer. Data-Intensive Text Processing with MapReduce. Morgan and Claypool, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Mesnier, M. Wachs, R. Sambasivan, A. Zheng, and G. Ganger. Modeling the Relative Fitness of Storage. SIGMETRICS, 35(1):37--48, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Mumak: Map-Reduce Simulator. https://issues.apache.org/jira/browse/MAPREDUCE-728.Google ScholarGoogle Scholar
  23. OpenCore Probes vs Sun BTrace. http://opencore.jinspired.com/?page_id=588.Google ScholarGoogle Scholar
  24. R. J. Quinlan. Learning with continuous classes. In 5th Australian Joint Conference on Artificial Intelligence, pages 343--348, 1992.Google ScholarGoogle Scholar
  25. A. Qureshi, R. Weber, H. Balakrishnan, J. V. Guttag, and B. Maggs. Cutting the Electric Bill for Internet-scale Systems. In SIGCOMM, pages 123--134, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. G. Wang, A. R. Butt, P. Pandey, and K. Gupta. A Simulation Approach to Evaluating Design Decisions in MapReduce Setups. In MASCOTS, pages 1--11, 2009.Google ScholarGoogle Scholar
  27. T. White. Hadoop: The Definitive Guide. Yahoo Press, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. T. Ye and S. Kalyanaraman. A Recursive Random Search Algorithm for Large-scale Network Parameter Configuration. In SIGMETRICS, pages 196--205, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. W. Zheng, R. Bianchini, J. Janakiraman, J. R. Santos, and Y. Turner. JustRunIt: Experiment-Based Management of Virtualized Data Centers. In USENIX ATC, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SOCC '11: Proceedings of the 2nd ACM Symposium on Cloud Computing
          October 2011
          377 pages
          ISBN:9781450309769
          DOI:10.1145/2038916

          Copyright © 2011 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 26 October 2011

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate169of722submissions,23%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader