research-article

No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics

Authors:
Herodotos Herodotou

Duke University

Duke University
View Profile

,
Fei Dong

Duke University

Duke University
View Profile

,
Shivnath Babu

Duke University

Duke University
View Profile

SOCC '11: Proceedings of the 2nd ACM Symposium on Cloud ComputingOctober 2011Article No.: 18Pages 1–14https://doi.org/10.1145/2038916.2038934

Published:26 October 2011Publication History

SOCC '11: Proceedings of the 2nd ACM Symposium on Cloud Computing

Pages 1–14

ABSTRACT

Infrastructure-as-a-Service (IaaS) cloud platforms have brought two unprecedented changes to cluster provisioning practices. First, any (nonexpert) user can provision a cluster of any size on the cloud within minutes to run her data-processing jobs. The user can terminate the cluster once her jobs complete, and she needs to pay only for the resources used and duration of use. Second, cloud platforms enable users to bypass the traditional middleman---the system administrator---in the cluster-provisioning process. These changes give tremendous power to the user, but place a major burden on her shoulders. The user is now faced regularly with complex cluster sizing problems that involve finding the cluster size, the type of resources to use in the cluster from the large number of choices offered by current IaaS cloud platforms, and the job configurations that best meet the performance needs of her workload.

In this paper, we introduce the Elastisizer, a system to which users can express cluster sizing problems as queries in a declarative fashion. The Elastisizer provides reliable answers to these queries using an automated technique that uses a mix of job profiling, estimation using black-box and white-box models, and simulation. We have prototyped the Elastisizer for the Hadoop MapReduce framework, and present a comprehensive evaluation that shows the benefits of the Elastisizer in common scenarios where cluster sizing problems arise.

References

Amazon Elastic MapReduce. http://aws.amazon.com/elasticmapreduce.Google Scholar
S. Babu. Towards Automatic Optimization of MapReduce Programs. In SOCC, pages 137--142, 2010. Google ScholarDigital Library
P. Bodik, R. Griffith, C. Sutton, A. Fox, M. Jordan, and D. Patterson. Statistical Machine Learning Makes Automatic Control Practical for Internet Datacenters. In HotCloud, 2009. Google ScholarDigital Library
Facebook on Hadoop, Hive, HBase, and A/B Testing. http://tinyurl.com/3dsdsh4.Google Scholar
BTrace: A Dynamic Instrumentation Tool for Java. http://kenai.com/projects/btrace.Google Scholar
B. M. Cantrill, M. W. Shapiro, and A. H. Leventhal. Dynamic Instrumentation of Production Systems. In USENIX ATEC, 2004. Google ScholarDigital Library
S. Chaudhuri, P. Ganesan, and V. R. Narasayya. Primitives for Workload Summarization and Implications for SQL. In VLDB, pages 730--741, 2003. Google ScholarDigital Library
N. Chohan, C. Castillo, M. Spreitzer, M. Steinder, A. Tantawi, and C. Krintz. See Spot Run: Using Spot Instances for MapReduce Workflows. In HotCloud, 2010. Google ScholarDigital Library
Cloudera: 7 tips for Improving MapReduce Performance. http://www.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/.Google Scholar
J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM, 51(1):107--113, 2008. Google ScholarDigital Library
S. Duan, V. Thummala, and S. Babu. Tuning Database Configuration Parameters with iTuned. PVLDB, 2(1):1246--1257, 2009. Google ScholarDigital Library
J. Hamilton. Resource Consumption Shaping. http://tinyurl.com/4m9vch.Google Scholar
H. Herodotou. Hadoop Performance Models. Technical report, Duke Univ., 2010. http://www.cs.duke.edu/starfish/files/hadoop-models.pdf.Google Scholar
H. Herodotou and S. Babu. Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs. PVLDB, 4, 2011.Google Scholar
H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, and S. Babu. Starfish: A Self-tuning System for Big Data Analytics. In CIDR, pages 261--272, 2011.Google Scholar
M.-Y. Iu and W. Zwaenepoel. HadoopToSQL: A MapReduce Query Optimizer. In EuroSys, pages 251--264, 2010. Google ScholarDigital Library
E. Jahani, M. J. Cafarella, and C. Ré. Automatic Optimization of MapReduce Programs. PVLDB, 4:386--396, 2011. Google ScholarDigital Library
K. Kambatla, A. Pathak, and H. Pucha. Towards Optimizing Hadoop Provisioning in the Cloud. In HotCloud, 2009. Google ScholarDigital Library
A. Li, X. Yang, S. Kandula, and M. Zhang. CloudCmp: Shopping for a Cloud Made Easy. In HotCloud, 2010. Google ScholarDigital Library
J. Lin and C. Dyer. Data-Intensive Text Processing with MapReduce. Morgan and Claypool, 2010. Google ScholarDigital Library
M. Mesnier, M. Wachs, R. Sambasivan, A. Zheng, and G. Ganger. Modeling the Relative Fitness of Storage. SIGMETRICS, 35(1):37--48, 2007. Google ScholarDigital Library
Mumak: Map-Reduce Simulator. https://issues.apache.org/jira/browse/MAPREDUCE-728.Google Scholar
OpenCore Probes vs Sun BTrace. http://opencore.jinspired.com/?page_id=588.Google Scholar
R. J. Quinlan. Learning with continuous classes. In 5th Australian Joint Conference on Artificial Intelligence, pages 343--348, 1992.Google Scholar
A. Qureshi, R. Weber, H. Balakrishnan, J. V. Guttag, and B. Maggs. Cutting the Electric Bill for Internet-scale Systems. In SIGCOMM, pages 123--134, 2009. Google ScholarDigital Library
G. Wang, A. R. Butt, P. Pandey, and K. Gupta. A Simulation Approach to Evaluating Design Decisions in MapReduce Setups. In MASCOTS, pages 1--11, 2009.Google Scholar
T. White. Hadoop: The Definitive Guide. Yahoo Press, 2010. Google ScholarDigital Library
T. Ye and S. Kalyanaraman. A Recursive Random Search Algorithm for Large-scale Network Parameter Configuration. In SIGMETRICS, pages 196--205, 2003. Google ScholarDigital Library
W. Zheng, R. Bianchini, J. Janakiraman, J. R. Santos, and Y. Turner. JustRunIt: Experiment-Based Management of Virtualized Data Centers. In USENIX ATC, 2009. Google ScholarDigital Library

Index Terms

No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics
1. Information systems
  1. Information retrieval
    1. Search engine architectures and scalability
      1. Distributed retrieval
      2. Peer-to-peer retrieval
  2. Information storage systems
    1. Storage architectures
      1. Distributed storage

Recommendations

MATE-EC2: a middleware for processing data with AWS
MTAGS '11: Proceedings of the 2011 ACM international workshop on Many task computing on grids and supercomputers

Recently, there has been growing interest in using Cloud resources for a variety of high performance and data-intensive applications. While there is currently a number of commercial Cloud service providers, Amazon Web Services (AWS) appears to be the ...
Read More
TomusBlobs: scalable data-intensive processing on Azure clouds

The emergence of cloud computing has brought the opportunity to use large-scale compute infrastructures for a broader and broader spectrum of applications and users. As the cloud paradigm gets attractive for the 'elasticity' in resource usage and ...
Read More
Challenges and Benefits of Deploying Big Data Analytics in the Cloud for Business Intelligence

Cloud computing and big data analytics are, without a doubt, two of the most important technologies to enter the mainstream IT industry in recent years. Surprisingly, the two technologies are coming together to deliver powerful results and benefits for ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SOCC '11: Proceedings of the 2nd ACM Symposium on Cloud Computing
October 2011
377 pages
ISBN:9781450309769
DOI:10.1145/2038916
Program Chairs:
Jeffrey S. Chase
Duke University
,
Amr El Abbadi
Univ of California, Santa Barbara
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 October 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
MapReduce
cloud computing
cluster provisioning
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate169of722submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 205
  Total Citations
  View Citations
- 1,402
  Total Downloads
- Downloads (Last 12 months)26
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics

SOCC '11: Proceedings of the 2nd ACM Symposium on Cloud Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

MATE-EC2: a middleware for processing data with AWS

TomusBlobs: scalable data-intensive processing on Azure clouds

Challenges and Benefits of Deploying Big Data Analytics in the Cloud for Business Intelligence

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics

SOCC '11: Proceedings of the 2nd ACM Symposium on Cloud Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

MATE-EC2: a middleware for processing data with AWS

TomusBlobs: scalable data-intensive processing on Azure clouds

Challenges and Benefits of Deploying Big Data Analytics in the Cloud for Business Intelligence

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media