research-article

Trojan data layouts: right shoes for a running elephant

Authors:
Alekh Jindal

Saarland University

Saarland University
View Profile

,
Jorge-Arnulfo Quiané-Ruiz

Saarland University

Saarland University
View Profile

,
Jens Dittrich

Saarland University

Saarland University
View Profile

SOCC '11: Proceedings of the 2nd ACM Symposium on Cloud ComputingOctober 2011Article No.: 21Pages 1–14https://doi.org/10.1145/2038916.2038937

Published:26 October 2011Publication History

SOCC '11: Proceedings of the 2nd ACM Symposium on Cloud Computing

Pages 1–14

ABSTRACT

MapReduce is becoming ubiquitous in large-scale data analysis. Several recent works have shown that the performance of Hadoop MapReduce could be improved, for instance, by creating indexes in a non-invasive manner. However, they ignore the impact of the data layout used inside data blocks of Hadoop Distributed File System (HDFS). In this paper, we analyze different data layouts in detail in the context of MapReduce and argue that Row, Column, and PAX layouts can lead to poor system performance. We propose a new data layout, coined Trojan Layout, that internally organizes data blocks into attribute groups according to the workload in order to improve data access times. A salient feature of Trojan Layout is that it fully preserves the fault-tolerance properties of MapReduce. We implement our Trojan Layout idea in HDFS 0.20.3 and call the resulting system Trojan HDFS. We exploit the fact that HDFS stores multiple replicas of each data block on different computing nodes. Trojan HDFS automatically creates a different Trojan Layout per replica to better fit the workload. As a result, we are able to schedule incoming MapReduce jobs to data block replicas with the most suitable Trojan Layout. We evaluate our approach using three real-world workloads. We compare Trojan Layouts against Hadoop using Row and PAX layouts. The results demonstrate that Trojan Layout allows MapReduce jobs to read their input data up to 4.8 times faster than Row layout and up to 3.5 times faster than PAX layout.

References

D. Abadi, P. Boncz, and S. Harizopoulos. Column-Oriented Database Systems. PVDLB, 2(2), 2009. Google ScholarDigital Library
D. Abadi et al. Materialization Strategies in a Column-Oriented DBMS. In ICDE, 2007.Google ScholarCross Ref
D. Abadi, S. Madden, and N. Hachem. Column-Stores vs. Row-Stores: How Different Are They Really? In SIGMOD, 2008. Google ScholarDigital Library
A. Abouzeid et al. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. PVLDB, 2(1), 2009. Google ScholarDigital Library
S. Agrawal et al. Integrating Vertical and Horizontal Partitioning into Automated Physical Database Design. In SIGMOD, 2004. Google ScholarDigital Library
A. Ailamaki et al. Weaving Relations for Cache Performance. In VLDB, 2001. Google ScholarDigital Library
M. J. Cafarella and C. Ré. Manimal: Relational Optimization for Data-Intensive Programs. In WebDB, 2010. Google ScholarDigital Library
Y. Cao et al. A Cloud Data Storage System for Supporting Both OLTP and OLAP. In ICDE, 2011. Google ScholarDigital Library
F. Chang et al. Bigtable: A Distributed Storage System for Structured Data. In OSDI, 2006. Google ScholarDigital Library
S. Chaudhuri. Self-Tuning Database Systems: A Decade of Progress (Ten Year Best paper Award). In VLDB, 2007. Google ScholarDigital Library
S. Chen. Cheetah: A High Performance, Custom Data Warehouse on Top of MapReduce. PVLDB, 3(2), 2010. Google ScholarDigital Library
G. P. Copeland and S. N. Khoshafian. A decomposition storage model. In SIGMOD, 1985. Google ScholarDigital Library
J. Dean and S. Ghemawat. MapReduce: A Flexible Data Processing Tool. CACM, 53(1): 72--77, 2010. Google ScholarDigital Library
J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing). PVLDB, 3(1), 2010. Google ScholarDigital Library
A. Floratou et al. Column-Oriented Storage Techniques for MapReduce. PVLDB, 4(7), 2011. Google ScholarDigital Library
M. Grund et al. HYRISE - A Main Memory Hybrid Storage Engine. PVLDB, 4(2), 2010. Google ScholarDigital Library
R. A. Hankins and J. M. Patel. Data Morphing: An Adaptive, Cache-Conscious Storage Technique. In VLDB, 2003. Google ScholarDigital Library
R. Ikeda and J. Widom;. Provenance for Generalized Map and Reduce Workflows. In CIDR, 2011.Google Scholar
M.-Y. Iu and W. Zwaenepoel. HadoopToSQL: A MapReduce Query Optimizer. In EuroSys, 2010. Google ScholarDigital Library
W. Lang and J. M. Patel. Energy Management for MapReduce Clusters. PVLDB, 3(1), 2010. Google ScholarDigital Library
C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. Google ScholarDigital Library
K. Morton, M. Balazinska, and D. Grossman. ParaTimer: A Progress Indicator for MapReduce DAGs. In SIGMOD, 2010. Google ScholarDigital Library
S. Navathe et al. Vertical Partitioning Algorithms for Database Design. ACM TODS, 9(4): 680--710, 1984. Google ScholarDigital Library
T. Nykiel et al. MRShare: Sharing Across Multiple Queries in MapReduce. PVLDB, 3(1), 2010. Google ScholarDigital Library
C. Olston et al. Pig Latin: A Not-So-Foreign Language for Data Processing. In SIGMOD Conference, 2008. Google ScholarDigital Library
A. Pavlo et al. A Comparison of Approaches to Large-Scale Data Analysis. In SIGMOD, 2009. Google ScholarDigital Library
J.-A. Quiané-Ruiz et al. RAFTing MapReduce: Fast Recovery on the Raft. In ICDE, 2011.Google Scholar
R. Ramamurthy, D. J. DeWitt. and Q. Su. A Case for Fractured Mirrors. In VLDB, 2002. Google ScholarDigital Library
D. Sacca and G. Wiederhold. Database Partitioning in a Cluster of Processors. ACM TODS, 10(1): 29--56, 1985. Google ScholarDigital Library
J. Schad, J. Dittrich, and J. Quiané-Ruiz. Runtime Measurements in the Cloud: Observing, Analyzing, and Reducing Variance. PVLDB, 3(1), 2010. Google ScholarDigital Library
M. Stonebraker et al. C-Store: A Column-Oriented DBMS. In VLDB, 2005. Google ScholarDigital Library
A. Thusoo et al. Data Warehousing and Analytics Infrastructure at Facebook. In SIGMOD, 2010. Google ScholarDigital Library
M. Zaharia et al. Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling. In EuroSys, 2010. Google ScholarDigital Library

Index Terms

Trojan data layouts: right shoes for a running elephant
1. Information systems
  1. Data management systems
    1. Database management system engines

Recommendations

Big Data Management: Advanced Issues and Approaches

The objective of this article is to provide the advanced issues and approaches of big data management. The literature review indicates the overview of big data management; the aspects of Big Data Analytics BDA; the importance of big data management; the ...
Read More
Disease Surveillance System for Big Climate Data Processing and Dengue Transmission

Ambient intelligence is an emerging platform that provides advances in sensors and sensor networks, pervasive computing, and artificial intelligence to capture the real time climate data. This result continuously generates several exabytes of ...
Read More
Challenges for MapReduce in Big Data
SERVICES '14: Proceedings of the 2014 IEEE World Congress on Services

In the Big Data community, MapReduce has been seen as one of the key enabling approaches for meeting continuously increasing demands on computing resources imposed by massive data sets. The reason for this is the high scalability of the MapReduce ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SOCC '11: Proceedings of the 2nd ACM Symposium on Cloud Computing
October 2011
377 pages
ISBN:9781450309769
DOI:10.1145/2038916
Program Chairs:
Jeffrey S. Chase
Duke University
,
Amr El Abbadi
Univ of California, Santa Barbara
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 October 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
MapReduce
column grouping
per-replica data layout
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate169of722submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 64
  Total Citations
  View Citations
- 684
  Total Downloads
- Downloads (Last 12 months)8
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Trojan data layouts: right shoes for a running elephant

SOCC '11: Proceedings of the 2nd ACM Symposium on Cloud Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Big Data Management: Advanced Issues and Approaches

Disease Surveillance System for Big Climate Data Processing and Dengue Transmission

Challenges for MapReduce in Big Data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Trojan data layouts: right shoes for a running elephant

SOCC '11: Proceedings of the 2nd ACM Symposium on Cloud Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Big Data Management: Advanced Issues and Approaches

Disease Surveillance System for Big Climate Data Processing and Dengue Transmission

Challenges for MapReduce in Big Data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media