ABSTRACT
We present CoScan, a scheduling framework that eliminates redundant processing in workflows that scan large batches of data in a map-reduce computing environment. CoScan merges Pig programs from multiple users at runtime to reduce I/O contention while adhering to soft deadline requirements in scheduling. This includes support for join workflows that operate on multiple data sources. Our solution maps well to workflows at many Internet companies which reuse data from a common set of inputs. Experiments on the PigMix data analytics benchmark exhibit orders of magnitude reduction in resource contention with minimal impact on latency.
- R. Abbott and H. Garcia-Molina. Scheduling Real-time Transactions. SIGMOD Rec., 17:71--81, March 1988. Google ScholarDigital Library
- P. Agrawal, D. Kifer, and C. Olston. Scheduling Shared Scans of Large Data Files. In VLDB, 2008. Google ScholarDigital Library
- H. Andrade, T. Kurc, A. Sussman, and J. Saltz. Efficient Execution of Multiple Query Workloads in Data Analysis Applications. In SC, 2001. Google ScholarDigital Library
- P. Brucker. Scheduling Algorithms (4th Ed.). Springer, 2004. Google ScholarDigital Library
- G. Candea, N. Polyzotis, and R. Vingralek. A Scalable, Predictable Join Operator for Highly Concurrent Data Warehouses. In VLDB, 2009. Google ScholarDigital Library
- F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A Distributed Storage System for Structured Data. In OSDI, 2006. Google ScholarDigital Library
- S. Chaudhuri, V. Narasayya, and R. Ramamurthy. Estimating Progress of Execution for SQL Queries. In SIGMOD, 2004. Google ScholarDigital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, 2004. Google ScholarDigital Library
- G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon's Highly Available Key-Value Store. In SOSP, 2007. Google ScholarDigital Library
- A. Dua and N. Bambos. Scheduling with Soft Deadlines for Input Queued Switches. In Allerton, 2006.Google Scholar
- Amazon EC2. http://aws.amazon.com/ec2.Google Scholar
- P. M. Fernandez. Red Brick Warehouse: A Read-mostly RDBMS for Open SMP Platforms. SIGMOD Rec., 23:492--502, May 1994. Google ScholarDigital Library
- S. Ganguly, W. Hasan, and R. Krishnamurthy. Query Optimization for Parallel Execution. SIGMOD Rec., 21:9--18, June 1992. Google ScholarDigital Library
- A. Gates, O. Natkovich, S. Chopra, P. Kamath, S. Narayanam, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a High-Level Dataflow System on top of MapReduce: The Pig Experience. PVLDB, 2(2):1414--1425, 2009. Google ScholarDigital Library
- A. Gupta, S. Sudarshan, and S. Vishwanathan. Query Scheduling in Multiquery Optimization. In IDEAS, 2001. Google ScholarDigital Library
- Apache. Hadoop: Open-Source Implementation of MapReduce. http://hadoop.apache.org.Google Scholar
- S. Harizopoulos, V. Shkapenyuk, and A. Ailamaki. QPipe: A Simultaneously Pipelined Relational Query Engine. In SIGMOD, 2005. Google ScholarDigital Library
- H. Hoogeveen. Multicriteria Scheduling. European Journal of Operational Research, 167:592--623, 2005.Google ScholarCross Ref
- D. Karger, C. Stein, and J. Wein. Scheduling Algorithms. In M. J. Atallah, editor, Handbook of Algorithms and Theory of Computation. CRC Press, 1997.Google Scholar
- R. M. Karp. Reducibility Among Combinatorial Problems. Complexity of Computer Computations, pages 85--103, 1972.Google Scholar
- K. Lai, L. Rasmusson, E. Adar, L. Zhang, and B. A. Huberman. Tycoon: An Implementation of a Distributed, Market-based Resource Allocation System. Multiagent Grid Syst., 1:169--182, August 2005. Google ScholarDigital Library
- A. Lakshman and P. Malik. Cassandra: A Decentralized Structured Storage System. SIGOPS Oper. Sys. Rev., 44(2):35--40, 2010. Google ScholarDigital Library
- J. Lenstra, A. R. Kan, and P. Brucker. Complexity of Machine Scheduling Problems. Annals of Discrete Mathematics, 1:343--362, 1977.Google ScholarCross Ref
- K. Morton, A. Friesen, M. Balazinska, and D. Grossman. Estimating the Progress of MapReduce Pipelines. In ICDE, 2010.Google ScholarCross Ref
- J. Myllymaki and M. Livny. Relational Joins for Data on Tertiary Storage. In ICDE, 1997. Google ScholarDigital Library
- T. Nykiel, M. Potamias, C. Mishra, G. Kollios, and N. Koudas. MRShare: Sharing Across Multiple Queries in MapReduce. Proc VLDB Endow., 3:494--505, September 2010. Google ScholarDigital Library
- C. Olston, G. Chiou, L. Chitnis, F. Liu, Y. Han, M. Larsson, A. Neumann, V. B. N. Rao, V. Sankarasubramanian, S. Seth, C. Tian, T. ZiCornell, and X. Wang. Nova: Continuous Pig/Hadoop Workflows. In SIGMOD, 2011. Google ScholarDigital Library
- C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. In SIGMOD, 2008. Google ScholarDigital Library
- E. Otoo, D. Rotem, and A. Romosan. Optimal File-Bundle Caching Algorithms for Data-Grids. In SC, 2004. Google ScholarDigital Library
- Pig Performance Benchmark. https://issues.apache.org/jira/browse/PIG-200.Google Scholar
- S. Sarawagi. Query Processing in Tertiary Memory Databases. In VLDB, 1995. Google ScholarDigital Library
- T. K. Sellis. Multiple-Query Optimization. ACM Trans. Database Syst., 13(1):23--52, March 1988. Google ScholarDigital Library
- A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive -- A Warehousing Solution Over a Map-Reduce Framework. In VLDB, 2009. Google ScholarDigital Library
- P. Unterbrunner, G. Giannikis, G. Alonso, D. Fauser, and D. Kossmann. Predictable Performance for Unpredictable Workloads. In VLDB, 2009. Google ScholarDigital Library
- X. Wang, R. Burns, and T. Malik. LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases. In CIDR, 2009.Google Scholar
- X. Wang, E. Perlman, R. Burns, T. Malik, T. Budavári, C. Meneveau, and A. Szalay. JAWS: Job-Aware Workload Scheduling for the Exploration of Turbulence Simulations. In SC, 2010. Google ScholarDigital Library
- J.-B. Yu and D. J. DeWitt. Query Pre-Execution and Batching in Paradise: A Two-Pronged Approach to the Efficient Processing of Queries on Tape-Resident Raster Images. In SSDBM, 1997. Google ScholarDigital Library
- M. Zukowski, S. Héman, N. Nes, and P. Boncz. Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS. In VLDB, 2007. Google ScholarDigital Library
Index Terms
- CoScan: cooperative scan sharing in the cloud
Comments