ABSTRACT
MapReduce is becoming ubiquitous in large-scale data analysis. Several recent works have shown that the performance of Hadoop MapReduce could be improved, for instance, by creating indexes in a non-invasive manner. However, they ignore the impact of the data layout used inside data blocks of Hadoop Distributed File System (HDFS). In this paper, we analyze different data layouts in detail in the context of MapReduce and argue that Row, Column, and PAX layouts can lead to poor system performance. We propose a new data layout, coined Trojan Layout, that internally organizes data blocks into attribute groups according to the workload in order to improve data access times. A salient feature of Trojan Layout is that it fully preserves the fault-tolerance properties of MapReduce. We implement our Trojan Layout idea in HDFS 0.20.3 and call the resulting system Trojan HDFS. We exploit the fact that HDFS stores multiple replicas of each data block on different computing nodes. Trojan HDFS automatically creates a different Trojan Layout per replica to better fit the workload. As a result, we are able to schedule incoming MapReduce jobs to data block replicas with the most suitable Trojan Layout. We evaluate our approach using three real-world workloads. We compare Trojan Layouts against Hadoop using Row and PAX layouts. The results demonstrate that Trojan Layout allows MapReduce jobs to read their input data up to 4.8 times faster than Row layout and up to 3.5 times faster than PAX layout.
- D. Abadi, P. Boncz, and S. Harizopoulos. Column-Oriented Database Systems. PVDLB, 2(2), 2009. Google ScholarDigital Library
- D. Abadi et al. Materialization Strategies in a Column-Oriented DBMS. In ICDE, 2007.Google ScholarCross Ref
- D. Abadi, S. Madden, and N. Hachem. Column-Stores vs. Row-Stores: How Different Are They Really? In SIGMOD, 2008. Google ScholarDigital Library
- A. Abouzeid et al. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. PVLDB, 2(1), 2009. Google ScholarDigital Library
- S. Agrawal et al. Integrating Vertical and Horizontal Partitioning into Automated Physical Database Design. In SIGMOD, 2004. Google ScholarDigital Library
- A. Ailamaki et al. Weaving Relations for Cache Performance. In VLDB, 2001. Google ScholarDigital Library
- M. J. Cafarella and C. Ré. Manimal: Relational Optimization for Data-Intensive Programs. In WebDB, 2010. Google ScholarDigital Library
- Y. Cao et al. A Cloud Data Storage System for Supporting Both OLTP and OLAP. In ICDE, 2011. Google ScholarDigital Library
- F. Chang et al. Bigtable: A Distributed Storage System for Structured Data. In OSDI, 2006. Google ScholarDigital Library
- S. Chaudhuri. Self-Tuning Database Systems: A Decade of Progress (Ten Year Best paper Award). In VLDB, 2007. Google ScholarDigital Library
- S. Chen. Cheetah: A High Performance, Custom Data Warehouse on Top of MapReduce. PVLDB, 3(2), 2010. Google ScholarDigital Library
- G. P. Copeland and S. N. Khoshafian. A decomposition storage model. In SIGMOD, 1985. Google ScholarDigital Library
- J. Dean and S. Ghemawat. MapReduce: A Flexible Data Processing Tool. CACM, 53(1): 72--77, 2010. Google ScholarDigital Library
- J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing). PVLDB, 3(1), 2010. Google ScholarDigital Library
- A. Floratou et al. Column-Oriented Storage Techniques for MapReduce. PVLDB, 4(7), 2011. Google ScholarDigital Library
- M. Grund et al. HYRISE - A Main Memory Hybrid Storage Engine. PVLDB, 4(2), 2010. Google ScholarDigital Library
- R. A. Hankins and J. M. Patel. Data Morphing: An Adaptive, Cache-Conscious Storage Technique. In VLDB, 2003. Google ScholarDigital Library
- R. Ikeda and J. Widom;. Provenance for Generalized Map and Reduce Workflows. In CIDR, 2011.Google Scholar
- M.-Y. Iu and W. Zwaenepoel. HadoopToSQL: A MapReduce Query Optimizer. In EuroSys, 2010. Google ScholarDigital Library
- W. Lang and J. M. Patel. Energy Management for MapReduce Clusters. PVLDB, 3(1), 2010. Google ScholarDigital Library
- C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. Google ScholarDigital Library
- K. Morton, M. Balazinska, and D. Grossman. ParaTimer: A Progress Indicator for MapReduce DAGs. In SIGMOD, 2010. Google ScholarDigital Library
- S. Navathe et al. Vertical Partitioning Algorithms for Database Design. ACM TODS, 9(4): 680--710, 1984. Google ScholarDigital Library
- T. Nykiel et al. MRShare: Sharing Across Multiple Queries in MapReduce. PVLDB, 3(1), 2010. Google ScholarDigital Library
- C. Olston et al. Pig Latin: A Not-So-Foreign Language for Data Processing. In SIGMOD Conference, 2008. Google ScholarDigital Library
- A. Pavlo et al. A Comparison of Approaches to Large-Scale Data Analysis. In SIGMOD, 2009. Google ScholarDigital Library
- J.-A. Quiané-Ruiz et al. RAFTing MapReduce: Fast Recovery on the Raft. In ICDE, 2011.Google Scholar
- R. Ramamurthy, D. J. DeWitt. and Q. Su. A Case for Fractured Mirrors. In VLDB, 2002. Google ScholarDigital Library
- D. Sacca and G. Wiederhold. Database Partitioning in a Cluster of Processors. ACM TODS, 10(1): 29--56, 1985. Google ScholarDigital Library
- J. Schad, J. Dittrich, and J. Quiané-Ruiz. Runtime Measurements in the Cloud: Observing, Analyzing, and Reducing Variance. PVLDB, 3(1), 2010. Google ScholarDigital Library
- M. Stonebraker et al. C-Store: A Column-Oriented DBMS. In VLDB, 2005. Google ScholarDigital Library
- A. Thusoo et al. Data Warehousing and Analytics Infrastructure at Facebook. In SIGMOD, 2010. Google ScholarDigital Library
- M. Zaharia et al. Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling. In EuroSys, 2010. Google ScholarDigital Library
Index Terms
- Trojan data layouts: right shoes for a running elephant
Recommendations
Big Data Management: Advanced Issues and Approaches
The objective of this article is to provide the advanced issues and approaches of big data management. The literature review indicates the overview of big data management; the aspects of Big Data Analytics BDA; the importance of big data management; the ...
Disease Surveillance System for Big Climate Data Processing and Dengue Transmission
Ambient intelligence is an emerging platform that provides advances in sensors and sensor networks, pervasive computing, and artificial intelligence to capture the real time climate data. This result continuously generates several exabytes of ...
Challenges for MapReduce in Big Data
SERVICES '14: Proceedings of the 2014 IEEE World Congress on ServicesIn the Big Data community, MapReduce has been seen as one of the key enabling approaches for meeting continuously increasing demands on computing resources imposed by massive data sets. The reason for this is the high scalability of the MapReduce ...
Comments