ABSTRACT
A similarity join combines vectors based on a distance condition. Typically, such algorithms apply a filter step (by indexing or sorting) and then refine pairs of candidate vectors. In this paper, we propose to refine the pairs in an order defined by a space-filling curve which dramatically improves data locality. Modern multi-core microprocessors are supported by a deep memory hierarchy including RAM, various levels of cache, and registers. The space-filling curve makes our proposed algorithm cache-oblivious to fully exploit the memory hierarchy and to reach the possible peak performance of a multi-core processor. Our novel space-filling curve called Fast General Form (FGF) Hilbert solves a number of limitations of well-known approaches: it is non-recursive, it is not restricted to traverse squares, and it has a constant time and space complexity. As we demonstrate the easy transformation from conventional into cache-oblivious loops we believe that many algorithms for complex joins and other database operators could be transformed systematically into cache-oblivious SIMD and MIMD parallel algorithms.
- Marcel R. Ackermann, Marcus M"a rtens, Christoph Raupach, Kamil Swierkot, Christiane Lammersen, and Christian Sohler. 2012. StreamKMGoogle Scholar
- : A clustering algorithm for data streams. ACM Journal of Experimental Algorithmics, Vol. 17, 1 (2012).Google Scholar
- Alexandr Andoni and Piotr Indyk. 2006. Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions. In FOCS 2006. 459--468. Google ScholarDigital Library
- Nikolaus Augsten and Michael H. Bö hlen. 2013. Similarity Joins in Relational Database Systems .Morgan & Claypool Publishers. Google ScholarDigital Library
- Michael Bader. 2008. Exploiting the Locality Properties of Peano Curves for Parallel Matrix Multiplication. In Euro-Par Conference. 801--810. Google ScholarDigital Library
- Michael Bader and Christian E. Mayer. 2006. Cache Oblivious Matrix Operations Using Peano Curves. In PARA Workshop . 521--530. Google ScholarDigital Library
- P. Baldi, P. Sadowski, and D. Whiteson. 2014. Searching for exotic particles in high-energy physics with deep learning. Nature Communications, Vol. 5 (02 Jul 2014), 4308 EP --. Article.Google Scholar
- Theodore Bially. 1969. Space-filling curves: Their generation and their application to bandwidth reduction. IEEE Trans. Information Theory, Vol. 15 (1969), 658--664. Google ScholarDigital Library
- Christian Bö hm, Bernhard Braunmü ller, Florian Krebs, and Hans-Peter Kriegel. 2001. Epsilon Grid Order: An Algorithm for the Similarity Join on Massive High-Dimensional Data. In SIGMOD Conf. 2001. 379--388. Google ScholarDigital Library
- Christian Bö hm and Hans-Peter Kriegel. 2001. A Cost Model and Index Architecture for the Similarity Join. In ICDE . 411--420. Google ScholarDigital Library
- Christian Bö hm, Robert Noll, Claudia Plant, and Andrew Zherdin. 2009. Indexsupported Similarity Join on Graphics Processors. In Datenbanksysteme in Business, Technologie und Web BTW 2009. 57--66.Google Scholar
- Christian Bö hm, Martin Perdacher, and Claudia Plant. 2018. A Novel Hilbert Curve for Cache-locality Preserving Loops. IEEE Transactions on Big Data (2018).Google Scholar
- Greg Breinholt and Christoph Schierz. 1998. Algorithm 781: Generating Hilbert's Space-filling Curve by Recursion. ACM Trans. Math. Softw., Vol. 24, 2 (June 1998), 184--189. Google ScholarDigital Library
- Thomas Brinkhoff, Hans-Peter Kriegel, and Bernhard Seeger. 1993. Efficient Processing of Spatial Joins Using R-Trees. In SIGMOD Conf. 1993 . 237--246. Google ScholarDigital Library
- Brent Bryan, Frederick Eberhardt, and Christos Faloutsos. 2008. Compact Similarity Joins. In ICDE. 346--355. Google ScholarDigital Library
- Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, and Gang Chen. 2017. Efficient Metric Indexing for Similarity Search and Similarity Joins. IEEE Trans. Knowl. Data Eng., Vol. 29, 3 (2017), 556--571. Google ScholarDigital Library
- Ningtao Chen, Nengchao Wang, and Baochang Shi. 2007. A new algorithm for encoding and decoding the Hilbert order. Softw., Pract. Exper., Vol. 37, 8 (2007), 897--908. Google ScholarDigital Library
- Paolo Ciaccia, Marco Patella, and Pavel Zezula. 1997. M-tree: An Efficient Access Method for Similarity Search in Metric Spaces. In VLDB'97 . 426--435. Google ScholarDigital Library
- Dong Deng, Yufei Tao, and Guoliang Li. 2018. Overlap Set Similarity Joins with Theoretical Guarantees. In SIGMOD Conf. 2018 . 905--920.Google Scholar
- Jens-Peter Dittrich and Bernhard Seeger. 2001. GESS: a scalable similarity-join algorithm for mining large data sets in high dimensional spaces. In SIGKDD. 47--56. Google ScholarDigital Library
- Vlastislav Dohnal, Claudio Gennaro, Pasquale Savino, and Pavel Zezula. 2003 b. D-Index: Distance Searching Index for Metric Data Sets. Multimedia Tools Appl., Vol. 21, 1 (2003), 9--33. Google ScholarDigital Library
- Vlastislav Dohnal, Claudio Gennaro, and Pavel Zezula. 2003 a. Similarity Join in Metric Spaces Using eD-Index. DEXA 2003. 484--493.Google Scholar
- Jack Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain S. Duff. 1990. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw., Vol. 16, 1 (1990), 1--17. Google ScholarDigital Library
- Ramez Elmasri and Shamkant B. Navathe. 2006. Fundamentals of Database Systems .Addison Wesley, 5th edition.Google ScholarDigital Library
- Miguel Ferreira, Nuno Roma, and Lu'i s M. S. Russo. 2014. Cache-Oblivious parallel SIMD Viterbi decoding for sequence search in HMMER . BMC Bioinformatics, Vol. 15 (2014), 165.Google ScholarCross Ref
- Fabian Fier, Nikolaus Augsten, Panagiotis Bouros, Ulf Leser, and Johann-Christoph Freytag. 2018. Set Similarity Joins on MapReduce: An Experimental Survey. PVLDB, Vol. 11, 10 (2018), 1110--1122. Google ScholarDigital Library
- F.D. Fracchia, P. Prusinkiewicz, and A. Lindenmayer. 1991. Synthesis of Space-filling Curves on the Square Grid. In Fractals in the Fundamental and Applied Sciences. 341--366.Google Scholar
- Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. 1999. Cache-Oblivious Algorithms. In FOCS 1999. 285--298. Google ScholarDigital Library
- Ali Hadian and Saeed Shahrivari. 2014. High performance parallel k-means clustering for disk-resident datasets on multi-core CPUs. The Journal of Supercomputing, Vol. 69, 2 (2014), 845--863.Google ScholarDigital Library
- Bingsheng He, Yinan Li, Qiong Luo, and Dongqing Yang. 2007. EaseDB: a cache-oblivious in-memory query processor. In SIGMOD Conf. 2007 . 1064--1066.Google ScholarDigital Library
- David Hilbert. 1891. Über die stetige Abbildung einer Linie auf ein Fl"achenstück. Math. Ann., Vol. 38 ( 1891).Google Scholar
- ThienLuan Ho, Seungrohk Oh, and Hyunjin Kim. 2018. New algorithms for fixed-length approximate string matching and approximate circular string matching under the Hamming distance. The Journal of Supercomputing, Vol. 74, 5 (2018), 1815--1834. Google ScholarDigital Library
- James Jeffers, James Reinders, and Avinash Sodani. 2016. Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition 2Nd Edition . Google ScholarDigital Library
- Dmitri V. Kalashnikov. 2013. Super-EGO: fast multi-dimensional similarity join. VLDB J., Vol. 22, 4 (2013), 561--585. Google ScholarDigital Library
- Dmitri V. Kalashnikov and Sunil Prabhakar. 2003. Similarity Join for Low-and High-Dimensional Data. (DASFAA '03). 7--16.Google Scholar
- Dmitri V. Kalashnikov and Sunil Prabhakar. 2007. Fast similarity join for multi-dimensional data. Inf. Syst., Vol. 32, 1 (2007), 160--177.Google ScholarDigital Library
- Nick Koudas and Kenneth C. Sevcik. 2000. High Dimensional Similarity Joins: Algorithms and Performance Evaluation. IEEE Trans. Knowl. Data Eng., Vol. 12, 1 (2000), 3--18. Google ScholarDigital Library
- Ye Li, Jian Wang, and Leong Hou U. 2016. Multidimensional Similarity Join Using MapReduce. Web-Age Information Management . 457--468.Google Scholar
- Michael D. Lieberman, Jagan Sankaranarayanan, and Hanan Samet. 2008. A Fast Similarity Join Algorithm Using Graphics Processing Units. In ICDE. 1111--1120. Google ScholarDigital Library
- Youzhong Ma, Shijie Jia, and Yongxin Zhang. 2017. A novel approach for high-dimensional vector similarity join query. Concurrency and Computation: Practice and Experience, Vol. 29, 5 (2017).Google ScholarCross Ref
- Samuel McCauley and Francesco Silvestri. 2018. Adaptive MapReduce Similarity Joins. In SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond. 4:1--4:4.Google Scholar
- Yisroel Mirsky, Tomer Doitshman, Yuval Elovici, and Asaf Shabtai. 2018. Kitsune: An Ensemble of Autoencoders for Online Network Intrusion Detection. In Network and Distributed System Security Symposium, NDSS.Google Scholar
- Bochang Moon, Yongyoung Byun, Tae-Joon Kim, Pio Claudio, Hye-Sun Kim, Yun-Ji Ban, Seung Woo Nam, and Sung-Eui Yoon. 2010. Cache-oblivious ray reordering. ACM Trans. Graph., Vol. 29, 3 (2010). Google ScholarDigital Library
- Jack A. Orenstein. 1986. Spatial Query Processing in an Object-oriented Database System. In SIGMOD Conf. 1986 . 326--336. Google ScholarDigital Library
- Rasmus Pagh, Ninh Pham, Francesco Silvestri, and Morten Stö ckel. 2017. I/O-Efficient Similarity Join. Algorithmica, Vol. 78, 4 (2017), 1263--1283. Google ScholarDigital Library
- Nicolas Papernot and Patrick D. McDaniel. 2018. Deep k-Nearest Neighbors: Towards Confident, Interpretable and Robust Deep Learning. CoRR, Vol. abs/1803.04765 (2018). arxiv: 1803.04765Google Scholar
- Rodrigo Paredes and Nora Reyes. 2009. Solving similarity joins and range queries in metric spaces with the list of twin clusters. J. Discrete Algorithms, Vol. 7, 1 (2009), 18--35.Google ScholarDigital Library
- Spencer S. Pearson and Yasin N. Silva. 2014. Index-Based R-S Similarity Joins. In SISAP. 106--112.Google Scholar
- P Prusinkiewicz. 1986. Graphical Applications of L-systems. In Proceedings on Graphics Interface '86/Vision Interface '86. 247--253. Google ScholarDigital Library
- Donovan A. Schneider and David J. DeWitt. 1989. A Performance Evaluation of Four Parallel Join Algorithms in a Shared-Nothing Multiprocessor Environment. In SIGMOD Conf. 1989. 110--121. Google ScholarDigital Library
- Zeyuan Shang, Yaxiao Liu, Guoliang Li, and Jianhua Feng. 2017. K-Join: Knowledge-Aware Similarity Join. In ICDE. 23--24.Google Scholar
- Rani Siromoney and K. G. Subramanian. 1983. Space-filling curves and infinite graphs. In Graph-Grammars and Their Application to Computer Science, Hartmut Ehrig, Manfred Nagl, and Grzegorz Rozenberg (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 380--391. Google ScholarDigital Library
- Allan Stisen, Henrik Blunck, Sourav Bhattacharya, Thor Siiger Prentow, Mikkel Baun Kjærgaard, Anind K. Dey, Tobias Sonne, and Mads Møller Jensen. 2015. Smart Devices are Different: Assessing and MitigatingMobile Sensing Heterogeneities for Activity Recognition. In Embedded Networked Sensor Systems, SenSys . 127--140. Google ScholarDigital Library
- Ye Wang, Ahmed Metwally, and Srinivasan Parthasarathy. 2013. Scalable all-pairs similarity search in metric spaces. In SIGKDD. 829--837. Google ScholarDigital Library
- Chuan Xiao, Wei Wang, and Xuemin Lin. 2008. Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB, Vol. 1, 1 (2008), 933--944. Google ScholarDigital Library
- Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu, and Guoren Wang. 2011. Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst., Vol. 36, 3 (2011), 15:1--15:41.Google ScholarDigital Library
- Chenyun Yu, Sarana Nutanong, Hangyu Li, Cong Wang, and Xingliang Yuan. 2017. A Generic Method for Accelerating LSH-Based Similarity Join Processing. IEEE Trans. Knowl. Data Eng., Vol. 29, 4 (2017), 712--726. Google ScholarDigital Library
- Weijie Zhao, Florin Rusu, Bin Dong, and Kesheng Wu. 2016. Similarity Join over Array Data. In SIGMOD Conf. 2016. 2007--2022. Google ScholarDigital Library
Index Terms
- Cache-oblivious High-performance Similarity Join
Recommendations
High performance cache replacement using re-reference interval prediction (RRIP)
ISCA '10Practical cache replacement policies attempt to emulate optimal replacement by predicting the re-reference interval of a cache block. The commonly used LRU replacement policy always predicts a near-immediate re-reference interval on cache hits and ...
High performance cache replacement using re-reference interval prediction (RRIP)
ISCA '10: Proceedings of the 37th annual international symposium on Computer architecturePractical cache replacement policies attempt to emulate optimal replacement by predicting the re-reference interval of a cache block. The commonly used LRU replacement policy always predicts a near-immediate re-reference interval on cache hits and ...
Cache-oblivious databases: Limitations and opportunities
Cache-oblivious techniques, proposed in the theory community, have optimal asymptotic bounds on the amount of data transferred between any two adjacent levels of an arbitrary memory hierarchy. Moreover, this optimal performance is achieved without any ...
Comments