ABSTRACT
Infrastructure-as-a-Service (IaaS) cloud platforms have brought two unprecedented changes to cluster provisioning practices. First, any (nonexpert) user can provision a cluster of any size on the cloud within minutes to run her data-processing jobs. The user can terminate the cluster once her jobs complete, and she needs to pay only for the resources used and duration of use. Second, cloud platforms enable users to bypass the traditional middleman---the system administrator---in the cluster-provisioning process. These changes give tremendous power to the user, but place a major burden on her shoulders. The user is now faced regularly with complex cluster sizing problems that involve finding the cluster size, the type of resources to use in the cluster from the large number of choices offered by current IaaS cloud platforms, and the job configurations that best meet the performance needs of her workload.
In this paper, we introduce the Elastisizer, a system to which users can express cluster sizing problems as queries in a declarative fashion. The Elastisizer provides reliable answers to these queries using an automated technique that uses a mix of job profiling, estimation using black-box and white-box models, and simulation. We have prototyped the Elastisizer for the Hadoop MapReduce framework, and present a comprehensive evaluation that shows the benefits of the Elastisizer in common scenarios where cluster sizing problems arise.
- Amazon Elastic MapReduce. http://aws.amazon.com/elasticmapreduce.Google Scholar
- S. Babu. Towards Automatic Optimization of MapReduce Programs. In SOCC, pages 137--142, 2010. Google ScholarDigital Library
- P. Bodik, R. Griffith, C. Sutton, A. Fox, M. Jordan, and D. Patterson. Statistical Machine Learning Makes Automatic Control Practical for Internet Datacenters. In HotCloud, 2009. Google ScholarDigital Library
- Facebook on Hadoop, Hive, HBase, and A/B Testing. http://tinyurl.com/3dsdsh4.Google Scholar
- BTrace: A Dynamic Instrumentation Tool for Java. http://kenai.com/projects/btrace.Google Scholar
- B. M. Cantrill, M. W. Shapiro, and A. H. Leventhal. Dynamic Instrumentation of Production Systems. In USENIX ATEC, 2004. Google ScholarDigital Library
- S. Chaudhuri, P. Ganesan, and V. R. Narasayya. Primitives for Workload Summarization and Implications for SQL. In VLDB, pages 730--741, 2003. Google ScholarDigital Library
- N. Chohan, C. Castillo, M. Spreitzer, M. Steinder, A. Tantawi, and C. Krintz. See Spot Run: Using Spot Instances for MapReduce Workflows. In HotCloud, 2010. Google ScholarDigital Library
- Cloudera: 7 tips for Improving MapReduce Performance. http://www.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/.Google Scholar
- J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM, 51(1):107--113, 2008. Google ScholarDigital Library
- S. Duan, V. Thummala, and S. Babu. Tuning Database Configuration Parameters with iTuned. PVLDB, 2(1):1246--1257, 2009. Google ScholarDigital Library
- J. Hamilton. Resource Consumption Shaping. http://tinyurl.com/4m9vch.Google Scholar
- H. Herodotou. Hadoop Performance Models. Technical report, Duke Univ., 2010. http://www.cs.duke.edu/starfish/files/hadoop-models.pdf.Google Scholar
- H. Herodotou and S. Babu. Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs. PVLDB, 4, 2011.Google Scholar
- H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, and S. Babu. Starfish: A Self-tuning System for Big Data Analytics. In CIDR, pages 261--272, 2011.Google Scholar
- M.-Y. Iu and W. Zwaenepoel. HadoopToSQL: A MapReduce Query Optimizer. In EuroSys, pages 251--264, 2010. Google ScholarDigital Library
- E. Jahani, M. J. Cafarella, and C. Ré. Automatic Optimization of MapReduce Programs. PVLDB, 4:386--396, 2011. Google ScholarDigital Library
- K. Kambatla, A. Pathak, and H. Pucha. Towards Optimizing Hadoop Provisioning in the Cloud. In HotCloud, 2009. Google ScholarDigital Library
- A. Li, X. Yang, S. Kandula, and M. Zhang. CloudCmp: Shopping for a Cloud Made Easy. In HotCloud, 2010. Google ScholarDigital Library
- J. Lin and C. Dyer. Data-Intensive Text Processing with MapReduce. Morgan and Claypool, 2010. Google ScholarDigital Library
- M. Mesnier, M. Wachs, R. Sambasivan, A. Zheng, and G. Ganger. Modeling the Relative Fitness of Storage. SIGMETRICS, 35(1):37--48, 2007. Google ScholarDigital Library
- Mumak: Map-Reduce Simulator. https://issues.apache.org/jira/browse/MAPREDUCE-728.Google Scholar
- OpenCore Probes vs Sun BTrace. http://opencore.jinspired.com/?page_id=588.Google Scholar
- R. J. Quinlan. Learning with continuous classes. In 5th Australian Joint Conference on Artificial Intelligence, pages 343--348, 1992.Google Scholar
- A. Qureshi, R. Weber, H. Balakrishnan, J. V. Guttag, and B. Maggs. Cutting the Electric Bill for Internet-scale Systems. In SIGCOMM, pages 123--134, 2009. Google ScholarDigital Library
- G. Wang, A. R. Butt, P. Pandey, and K. Gupta. A Simulation Approach to Evaluating Design Decisions in MapReduce Setups. In MASCOTS, pages 1--11, 2009.Google Scholar
- T. White. Hadoop: The Definitive Guide. Yahoo Press, 2010. Google ScholarDigital Library
- T. Ye and S. Kalyanaraman. A Recursive Random Search Algorithm for Large-scale Network Parameter Configuration. In SIGMETRICS, pages 196--205, 2003. Google ScholarDigital Library
- W. Zheng, R. Bianchini, J. Janakiraman, J. R. Santos, and Y. Turner. JustRunIt: Experiment-Based Management of Virtualized Data Centers. In USENIX ATC, 2009. Google ScholarDigital Library
Index Terms
- No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics
Recommendations
MATE-EC2: a middleware for processing data with AWS
MTAGS '11: Proceedings of the 2011 ACM international workshop on Many task computing on grids and supercomputersRecently, there has been growing interest in using Cloud resources for a variety of high performance and data-intensive applications. While there is currently a number of commercial Cloud service providers, Amazon Web Services (AWS) appears to be the ...
TomusBlobs: scalable data-intensive processing on Azure clouds
The emergence of cloud computing has brought the opportunity to use large-scale compute infrastructures for a broader and broader spectrum of applications and users. As the cloud paradigm gets attractive for the 'elasticity' in resource usage and ...
Challenges and Benefits of Deploying Big Data Analytics in the Cloud for Business Intelligence
Cloud computing and big data analytics are, without a doubt, two of the most important technologies to enter the mainstream IT industry in recent years. Surprisingly, the two technologies are coming together to deliver powerful results and benefits for ...
Comments