research-article

Cache-oblivious High-performance Similarity Join

Authors:
Martin Perdacher

University of Vienna, Vienna, Austria

University of Vienna, Vienna, Austria
View Profile

,
Claudia Plant

University of Vienna, Vienna, Austria

University of Vienna, Vienna, Austria
View Profile

,
Christian Böhm

Ludwig-Maximilians-Universität, Munich, Germany

Ludwig-Maximilians-Universität, Munich, Germany
View Profile

SIGMOD '19: Proceedings of the 2019 International Conference on Management of DataJune 2019Pages 87–104https://doi.org/10.1145/3299869.3319859

Published:25 June 2019Publication History

SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data

Pages 87–104

ABSTRACT

A similarity join combines vectors based on a distance condition. Typically, such algorithms apply a filter step (by indexing or sorting) and then refine pairs of candidate vectors. In this paper, we propose to refine the pairs in an order defined by a space-filling curve which dramatically improves data locality. Modern multi-core microprocessors are supported by a deep memory hierarchy including RAM, various levels of cache, and registers. The space-filling curve makes our proposed algorithm cache-oblivious to fully exploit the memory hierarchy and to reach the possible peak performance of a multi-core processor. Our novel space-filling curve called Fast General Form (FGF) Hilbert solves a number of limitations of well-known approaches: it is non-recursive, it is not restricted to traverse squares, and it has a constant time and space complexity. As we demonstrate the easy transformation from conventional into cache-oblivious loops we believe that many algorithms for complex joins and other database operators could be transformed systematically into cache-oblivious SIMD and MIMD parallel algorithms.

References

Marcel R. Ackermann, Marcus M"a rtens, Christoph Raupach, Kamil Swierkot, Christiane Lammersen, and Christian Sohler. 2012. StreamKMGoogle Scholar
: A clustering algorithm for data streams. ACM Journal of Experimental Algorithmics, Vol. 17, 1 (2012).Google Scholar
Alexandr Andoni and Piotr Indyk. 2006. Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions. In FOCS 2006. 459--468. Google ScholarDigital Library
Nikolaus Augsten and Michael H. Bö hlen. 2013. Similarity Joins in Relational Database Systems .Morgan & Claypool Publishers. Google ScholarDigital Library
Michael Bader. 2008. Exploiting the Locality Properties of Peano Curves for Parallel Matrix Multiplication. In Euro-Par Conference. 801--810. Google ScholarDigital Library
Michael Bader and Christian E. Mayer. 2006. Cache Oblivious Matrix Operations Using Peano Curves. In PARA Workshop . 521--530. Google ScholarDigital Library
P. Baldi, P. Sadowski, and D. Whiteson. 2014. Searching for exotic particles in high-energy physics with deep learning. Nature Communications, Vol. 5 (02 Jul 2014), 4308 EP --. Article.Google Scholar
Theodore Bially. 1969. Space-filling curves: Their generation and their application to bandwidth reduction. IEEE Trans. Information Theory, Vol. 15 (1969), 658--664. Google ScholarDigital Library
Christian Bö hm, Bernhard Braunmü ller, Florian Krebs, and Hans-Peter Kriegel. 2001. Epsilon Grid Order: An Algorithm for the Similarity Join on Massive High-Dimensional Data. In SIGMOD Conf. 2001. 379--388. Google ScholarDigital Library
Christian Bö hm and Hans-Peter Kriegel. 2001. A Cost Model and Index Architecture for the Similarity Join. In ICDE . 411--420. Google ScholarDigital Library
Christian Bö hm, Robert Noll, Claudia Plant, and Andrew Zherdin. 2009. Indexsupported Similarity Join on Graphics Processors. In Datenbanksysteme in Business, Technologie und Web BTW 2009. 57--66.Google Scholar
Christian Bö hm, Martin Perdacher, and Claudia Plant. 2018. A Novel Hilbert Curve for Cache-locality Preserving Loops. IEEE Transactions on Big Data (2018).Google Scholar
Greg Breinholt and Christoph Schierz. 1998. Algorithm 781: Generating Hilbert's Space-filling Curve by Recursion. ACM Trans. Math. Softw., Vol. 24, 2 (June 1998), 184--189. Google ScholarDigital Library
Thomas Brinkhoff, Hans-Peter Kriegel, and Bernhard Seeger. 1993. Efficient Processing of Spatial Joins Using R-Trees. In SIGMOD Conf. 1993 . 237--246. Google ScholarDigital Library
Brent Bryan, Frederick Eberhardt, and Christos Faloutsos. 2008. Compact Similarity Joins. In ICDE. 346--355. Google ScholarDigital Library
Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, and Gang Chen. 2017. Efficient Metric Indexing for Similarity Search and Similarity Joins. IEEE Trans. Knowl. Data Eng., Vol. 29, 3 (2017), 556--571. Google ScholarDigital Library
Ningtao Chen, Nengchao Wang, and Baochang Shi. 2007. A new algorithm for encoding and decoding the Hilbert order. Softw., Pract. Exper., Vol. 37, 8 (2007), 897--908. Google ScholarDigital Library
Paolo Ciaccia, Marco Patella, and Pavel Zezula. 1997. M-tree: An Efficient Access Method for Similarity Search in Metric Spaces. In VLDB'97 . 426--435. Google ScholarDigital Library
Dong Deng, Yufei Tao, and Guoliang Li. 2018. Overlap Set Similarity Joins with Theoretical Guarantees. In SIGMOD Conf. 2018 . 905--920.Google Scholar
Jens-Peter Dittrich and Bernhard Seeger. 2001. GESS: a scalable similarity-join algorithm for mining large data sets in high dimensional spaces. In SIGKDD. 47--56. Google ScholarDigital Library
Vlastislav Dohnal, Claudio Gennaro, Pasquale Savino, and Pavel Zezula. 2003 b. D-Index: Distance Searching Index for Metric Data Sets. Multimedia Tools Appl., Vol. 21, 1 (2003), 9--33. Google ScholarDigital Library
Vlastislav Dohnal, Claudio Gennaro, and Pavel Zezula. 2003 a. Similarity Join in Metric Spaces Using eD-Index. DEXA 2003. 484--493.Google Scholar
Jack Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain S. Duff. 1990. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw., Vol. 16, 1 (1990), 1--17. Google ScholarDigital Library
Ramez Elmasri and Shamkant B. Navathe. 2006. Fundamentals of Database Systems .Addison Wesley, 5th edition.Google ScholarDigital Library
Miguel Ferreira, Nuno Roma, and Lu'i s M. S. Russo. 2014. Cache-Oblivious parallel SIMD Viterbi decoding for sequence search in HMMER . BMC Bioinformatics, Vol. 15 (2014), 165.Google ScholarCross Ref
Fabian Fier, Nikolaus Augsten, Panagiotis Bouros, Ulf Leser, and Johann-Christoph Freytag. 2018. Set Similarity Joins on MapReduce: An Experimental Survey. PVLDB, Vol. 11, 10 (2018), 1110--1122. Google ScholarDigital Library
F.D. Fracchia, P. Prusinkiewicz, and A. Lindenmayer. 1991. Synthesis of Space-filling Curves on the Square Grid. In Fractals in the Fundamental and Applied Sciences. 341--366.Google Scholar
Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. 1999. Cache-Oblivious Algorithms. In FOCS 1999. 285--298. Google ScholarDigital Library
Ali Hadian and Saeed Shahrivari. 2014. High performance parallel k-means clustering for disk-resident datasets on multi-core CPUs. The Journal of Supercomputing, Vol. 69, 2 (2014), 845--863.Google ScholarDigital Library
Bingsheng He, Yinan Li, Qiong Luo, and Dongqing Yang. 2007. EaseDB: a cache-oblivious in-memory query processor. In SIGMOD Conf. 2007 . 1064--1066.Google ScholarDigital Library
David Hilbert. 1891. Über die stetige Abbildung einer Linie auf ein Fl"achenstück. Math. Ann., Vol. 38 ( 1891).Google Scholar
ThienLuan Ho, Seungrohk Oh, and Hyunjin Kim. 2018. New algorithms for fixed-length approximate string matching and approximate circular string matching under the Hamming distance. The Journal of Supercomputing, Vol. 74, 5 (2018), 1815--1834. Google ScholarDigital Library
James Jeffers, James Reinders, and Avinash Sodani. 2016. Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition 2Nd Edition . Google ScholarDigital Library
Dmitri V. Kalashnikov. 2013. Super-EGO: fast multi-dimensional similarity join. VLDB J., Vol. 22, 4 (2013), 561--585. Google ScholarDigital Library
Dmitri V. Kalashnikov and Sunil Prabhakar. 2003. Similarity Join for Low-and High-Dimensional Data. (DASFAA '03). 7--16.Google Scholar
Dmitri V. Kalashnikov and Sunil Prabhakar. 2007. Fast similarity join for multi-dimensional data. Inf. Syst., Vol. 32, 1 (2007), 160--177.Google ScholarDigital Library
Nick Koudas and Kenneth C. Sevcik. 2000. High Dimensional Similarity Joins: Algorithms and Performance Evaluation. IEEE Trans. Knowl. Data Eng., Vol. 12, 1 (2000), 3--18. Google ScholarDigital Library
Ye Li, Jian Wang, and Leong Hou U. 2016. Multidimensional Similarity Join Using MapReduce. Web-Age Information Management . 457--468.Google Scholar
Michael D. Lieberman, Jagan Sankaranarayanan, and Hanan Samet. 2008. A Fast Similarity Join Algorithm Using Graphics Processing Units. In ICDE. 1111--1120. Google ScholarDigital Library
Youzhong Ma, Shijie Jia, and Yongxin Zhang. 2017. A novel approach for high-dimensional vector similarity join query. Concurrency and Computation: Practice and Experience, Vol. 29, 5 (2017).Google ScholarCross Ref
Samuel McCauley and Francesco Silvestri. 2018. Adaptive MapReduce Similarity Joins. In SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond. 4:1--4:4.Google Scholar
Yisroel Mirsky, Tomer Doitshman, Yuval Elovici, and Asaf Shabtai. 2018. Kitsune: An Ensemble of Autoencoders for Online Network Intrusion Detection. In Network and Distributed System Security Symposium, NDSS.Google Scholar
Bochang Moon, Yongyoung Byun, Tae-Joon Kim, Pio Claudio, Hye-Sun Kim, Yun-Ji Ban, Seung Woo Nam, and Sung-Eui Yoon. 2010. Cache-oblivious ray reordering. ACM Trans. Graph., Vol. 29, 3 (2010). Google ScholarDigital Library
Jack A. Orenstein. 1986. Spatial Query Processing in an Object-oriented Database System. In SIGMOD Conf. 1986 . 326--336. Google ScholarDigital Library
Rasmus Pagh, Ninh Pham, Francesco Silvestri, and Morten Stö ckel. 2017. I/O-Efficient Similarity Join. Algorithmica, Vol. 78, 4 (2017), 1263--1283. Google ScholarDigital Library
Nicolas Papernot and Patrick D. McDaniel. 2018. Deep k-Nearest Neighbors: Towards Confident, Interpretable and Robust Deep Learning. CoRR, Vol. abs/1803.04765 (2018). arxiv: 1803.04765Google Scholar
Rodrigo Paredes and Nora Reyes. 2009. Solving similarity joins and range queries in metric spaces with the list of twin clusters. J. Discrete Algorithms, Vol. 7, 1 (2009), 18--35.Google ScholarDigital Library
Spencer S. Pearson and Yasin N. Silva. 2014. Index-Based R-S Similarity Joins. In SISAP. 106--112.Google Scholar
P Prusinkiewicz. 1986. Graphical Applications of L-systems. In Proceedings on Graphics Interface '86/Vision Interface '86. 247--253. Google ScholarDigital Library
Donovan A. Schneider and David J. DeWitt. 1989. A Performance Evaluation of Four Parallel Join Algorithms in a Shared-Nothing Multiprocessor Environment. In SIGMOD Conf. 1989. 110--121. Google ScholarDigital Library
Zeyuan Shang, Yaxiao Liu, Guoliang Li, and Jianhua Feng. 2017. K-Join: Knowledge-Aware Similarity Join. In ICDE. 23--24.Google Scholar
Rani Siromoney and K. G. Subramanian. 1983. Space-filling curves and infinite graphs. In Graph-Grammars and Their Application to Computer Science, Hartmut Ehrig, Manfred Nagl, and Grzegorz Rozenberg (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 380--391. Google ScholarDigital Library
Allan Stisen, Henrik Blunck, Sourav Bhattacharya, Thor Siiger Prentow, Mikkel Baun Kjærgaard, Anind K. Dey, Tobias Sonne, and Mads Møller Jensen. 2015. Smart Devices are Different: Assessing and MitigatingMobile Sensing Heterogeneities for Activity Recognition. In Embedded Networked Sensor Systems, SenSys . 127--140. Google ScholarDigital Library
Ye Wang, Ahmed Metwally, and Srinivasan Parthasarathy. 2013. Scalable all-pairs similarity search in metric spaces. In SIGKDD. 829--837. Google ScholarDigital Library
Chuan Xiao, Wei Wang, and Xuemin Lin. 2008. Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB, Vol. 1, 1 (2008), 933--944. Google ScholarDigital Library
Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu, and Guoren Wang. 2011. Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst., Vol. 36, 3 (2011), 15:1--15:41.Google ScholarDigital Library
Chenyun Yu, Sarana Nutanong, Hangyu Li, Cong Wang, and Xingliang Yuan. 2017. A Generic Method for Accelerating LSH-Based Similarity Join Processing. IEEE Trans. Knowl. Data Eng., Vol. 29, 4 (2017), 712--726. Google ScholarDigital Library
Weijie Zhao, Florin Rusu, Bin Dong, and Kesheng Wu. 2016. Similarity Join over Array Data. In SIGMOD Conf. 2016. 2007--2022. Google ScholarDigital Library

Index Terms

Cache-oblivious High-performance Similarity Join
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms
      1. Shared memory algorithms
2. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
        Join algorithms
  2. Information systems applications
    1. Data mining
      1. Clustering
      2. Nearest-neighbor search

Recommendations

High performance cache replacement using re-reference interval prediction (RRIP)
ISCA '10

Practical cache replacement policies attempt to emulate optimal replacement by predicting the re-reference interval of a cache block. The commonly used LRU replacement policy always predicts a near-immediate re-reference interval on cache hits and ...
Read More
High performance cache replacement using re-reference interval prediction (RRIP)
ISCA '10: Proceedings of the 37th annual international symposium on Computer architecture

Practical cache replacement policies attempt to emulate optimal replacement by predicting the re-reference interval of a cache block. The commonly used LRU replacement policy always predicts a near-immediate re-reference interval on cache hits and ...
Read More
Cache-oblivious databases: Limitations and opportunities

Cache-oblivious techniques, proposed in the theory community, have optimal asymptotic bounds on the amount of data transferred between any two adjacent levels of an arbitrary memory hierarchy. Moreover, this optimal performance is achieved without any ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data
June 2019
2106 pages
ISBN:9781450356435
DOI:10.1145/3299869
General Chairs:
Peter Boncz
CWI & Vrije Universiteit Amsterdam, The Netherlands
,
Stefan Manegold
CWI & Universiteit Leiden, The Netherlands
,
Program Chairs:
Anastasia Ailamaki
EPFL, Switzerland
,
Amol Deshpande
University of Maryland, USA
,
Tim Kraska
MIT, USA
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 June 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
- Results Reproduced / v1.1
Author Tags
cache-oblivious
epsilon grid order
hilbert-curve
similarity join
space-filling curve
Qualifiers
- research-article
Conference

Acceptance Rates
SIGMOD '19 Paper Acceptance Rate88of430submissions,20%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 10
  Total Citations
  View Citations
- 776
  Total Downloads
- Downloads (Last 12 months)38
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.