Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2038916.2038937acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Trojan data layouts: right shoes for a running elephant

Published:26 October 2011Publication History

ABSTRACT

MapReduce is becoming ubiquitous in large-scale data analysis. Several recent works have shown that the performance of Hadoop MapReduce could be improved, for instance, by creating indexes in a non-invasive manner. However, they ignore the impact of the data layout used inside data blocks of Hadoop Distributed File System (HDFS). In this paper, we analyze different data layouts in detail in the context of MapReduce and argue that Row, Column, and PAX layouts can lead to poor system performance. We propose a new data layout, coined Trojan Layout, that internally organizes data blocks into attribute groups according to the workload in order to improve data access times. A salient feature of Trojan Layout is that it fully preserves the fault-tolerance properties of MapReduce. We implement our Trojan Layout idea in HDFS 0.20.3 and call the resulting system Trojan HDFS. We exploit the fact that HDFS stores multiple replicas of each data block on different computing nodes. Trojan HDFS automatically creates a different Trojan Layout per replica to better fit the workload. As a result, we are able to schedule incoming MapReduce jobs to data block replicas with the most suitable Trojan Layout. We evaluate our approach using three real-world workloads. We compare Trojan Layouts against Hadoop using Row and PAX layouts. The results demonstrate that Trojan Layout allows MapReduce jobs to read their input data up to 4.8 times faster than Row layout and up to 3.5 times faster than PAX layout.

References

  1. D. Abadi, P. Boncz, and S. Harizopoulos. Column-Oriented Database Systems. PVDLB, 2(2), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. D. Abadi et al. Materialization Strategies in a Column-Oriented DBMS. In ICDE, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  3. D. Abadi, S. Madden, and N. Hachem. Column-Stores vs. Row-Stores: How Different Are They Really? In SIGMOD, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Abouzeid et al. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. PVLDB, 2(1), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Agrawal et al. Integrating Vertical and Horizontal Partitioning into Automated Physical Database Design. In SIGMOD, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Ailamaki et al. Weaving Relations for Cache Performance. In VLDB, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. J. Cafarella and C. Ré. Manimal: Relational Optimization for Data-Intensive Programs. In WebDB, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Y. Cao et al. A Cloud Data Storage System for Supporting Both OLTP and OLAP. In ICDE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. F. Chang et al. Bigtable: A Distributed Storage System for Structured Data. In OSDI, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Chaudhuri. Self-Tuning Database Systems: A Decade of Progress (Ten Year Best paper Award). In VLDB, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Chen. Cheetah: A High Performance, Custom Data Warehouse on Top of MapReduce. PVLDB, 3(2), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. G. P. Copeland and S. N. Khoshafian. A decomposition storage model. In SIGMOD, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Dean and S. Ghemawat. MapReduce: A Flexible Data Processing Tool. CACM, 53(1): 72--77, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing). PVLDB, 3(1), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Floratou et al. Column-Oriented Storage Techniques for MapReduce. PVLDB, 4(7), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Grund et al. HYRISE - A Main Memory Hybrid Storage Engine. PVLDB, 4(2), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. R. A. Hankins and J. M. Patel. Data Morphing: An Adaptive, Cache-Conscious Storage Technique. In VLDB, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. R. Ikeda and J. Widom;. Provenance for Generalized Map and Reduce Workflows. In CIDR, 2011.Google ScholarGoogle Scholar
  19. M.-Y. Iu and W. Zwaenepoel. HadoopToSQL: A MapReduce Query Optimizer. In EuroSys, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. W. Lang and J. M. Patel. Energy Management for MapReduce Clusters. PVLDB, 3(1), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. K. Morton, M. Balazinska, and D. Grossman. ParaTimer: A Progress Indicator for MapReduce DAGs. In SIGMOD, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. Navathe et al. Vertical Partitioning Algorithms for Database Design. ACM TODS, 9(4): 680--710, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. T. Nykiel et al. MRShare: Sharing Across Multiple Queries in MapReduce. PVLDB, 3(1), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. C. Olston et al. Pig Latin: A Not-So-Foreign Language for Data Processing. In SIGMOD Conference, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. Pavlo et al. A Comparison of Approaches to Large-Scale Data Analysis. In SIGMOD, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J.-A. Quiané-Ruiz et al. RAFTing MapReduce: Fast Recovery on the Raft. In ICDE, 2011.Google ScholarGoogle Scholar
  28. R. Ramamurthy, D. J. DeWitt. and Q. Su. A Case for Fractured Mirrors. In VLDB, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. D. Sacca and G. Wiederhold. Database Partitioning in a Cluster of Processors. ACM TODS, 10(1): 29--56, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. J. Schad, J. Dittrich, and J. Quiané-Ruiz. Runtime Measurements in the Cloud: Observing, Analyzing, and Reducing Variance. PVLDB, 3(1), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. M. Stonebraker et al. C-Store: A Column-Oriented DBMS. In VLDB, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. A. Thusoo et al. Data Warehousing and Analytics Infrastructure at Facebook. In SIGMOD, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. M. Zaharia et al. Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling. In EuroSys, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Trojan data layouts: right shoes for a running elephant

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SOCC '11: Proceedings of the 2nd ACM Symposium on Cloud Computing
      October 2011
      377 pages
      ISBN:9781450309769
      DOI:10.1145/2038916

      Copyright © 2011 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 26 October 2011

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate169of722submissions,23%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader