Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3299869.3319859acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Results Reproduced / v1.1

Cache-oblivious High-performance Similarity Join

Published:25 June 2019Publication History

ABSTRACT

A similarity join combines vectors based on a distance condition. Typically, such algorithms apply a filter step (by indexing or sorting) and then refine pairs of candidate vectors. In this paper, we propose to refine the pairs in an order defined by a space-filling curve which dramatically improves data locality. Modern multi-core microprocessors are supported by a deep memory hierarchy including RAM, various levels of cache, and registers. The space-filling curve makes our proposed algorithm cache-oblivious to fully exploit the memory hierarchy and to reach the possible peak performance of a multi-core processor. Our novel space-filling curve called Fast General Form (FGF) Hilbert solves a number of limitations of well-known approaches: it is non-recursive, it is not restricted to traverse squares, and it has a constant time and space complexity. As we demonstrate the easy transformation from conventional into cache-oblivious loops we believe that many algorithms for complex joins and other database operators could be transformed systematically into cache-oblivious SIMD and MIMD parallel algorithms.

References

  1. Marcel R. Ackermann, Marcus M"a rtens, Christoph Raupach, Kamil Swierkot, Christiane Lammersen, and Christian Sohler. 2012. StreamKMGoogle ScholarGoogle Scholar
  2. : A clustering algorithm for data streams. ACM Journal of Experimental Algorithmics, Vol. 17, 1 (2012).Google ScholarGoogle Scholar
  3. Alexandr Andoni and Piotr Indyk. 2006. Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions. In FOCS 2006. 459--468. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Nikolaus Augsten and Michael H. Bö hlen. 2013. Similarity Joins in Relational Database Systems .Morgan & Claypool Publishers. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Michael Bader. 2008. Exploiting the Locality Properties of Peano Curves for Parallel Matrix Multiplication. In Euro-Par Conference. 801--810. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Michael Bader and Christian E. Mayer. 2006. Cache Oblivious Matrix Operations Using Peano Curves. In PARA Workshop . 521--530. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. Baldi, P. Sadowski, and D. Whiteson. 2014. Searching for exotic particles in high-energy physics with deep learning. Nature Communications, Vol. 5 (02 Jul 2014), 4308 EP --. Article.Google ScholarGoogle Scholar
  8. Theodore Bially. 1969. Space-filling curves: Their generation and their application to bandwidth reduction. IEEE Trans. Information Theory, Vol. 15 (1969), 658--664. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Christian Bö hm, Bernhard Braunmü ller, Florian Krebs, and Hans-Peter Kriegel. 2001. Epsilon Grid Order: An Algorithm for the Similarity Join on Massive High-Dimensional Data. In SIGMOD Conf. 2001. 379--388. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Christian Bö hm and Hans-Peter Kriegel. 2001. A Cost Model and Index Architecture for the Similarity Join. In ICDE . 411--420. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Christian Bö hm, Robert Noll, Claudia Plant, and Andrew Zherdin. 2009. Indexsupported Similarity Join on Graphics Processors. In Datenbanksysteme in Business, Technologie und Web BTW 2009. 57--66.Google ScholarGoogle Scholar
  12. Christian Bö hm, Martin Perdacher, and Claudia Plant. 2018. A Novel Hilbert Curve for Cache-locality Preserving Loops. IEEE Transactions on Big Data (2018).Google ScholarGoogle Scholar
  13. Greg Breinholt and Christoph Schierz. 1998. Algorithm 781: Generating Hilbert's Space-filling Curve by Recursion. ACM Trans. Math. Softw., Vol. 24, 2 (June 1998), 184--189. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Thomas Brinkhoff, Hans-Peter Kriegel, and Bernhard Seeger. 1993. Efficient Processing of Spatial Joins Using R-Trees. In SIGMOD Conf. 1993 . 237--246. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Brent Bryan, Frederick Eberhardt, and Christos Faloutsos. 2008. Compact Similarity Joins. In ICDE. 346--355. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, and Gang Chen. 2017. Efficient Metric Indexing for Similarity Search and Similarity Joins. IEEE Trans. Knowl. Data Eng., Vol. 29, 3 (2017), 556--571. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Ningtao Chen, Nengchao Wang, and Baochang Shi. 2007. A new algorithm for encoding and decoding the Hilbert order. Softw., Pract. Exper., Vol. 37, 8 (2007), 897--908. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Paolo Ciaccia, Marco Patella, and Pavel Zezula. 1997. M-tree: An Efficient Access Method for Similarity Search in Metric Spaces. In VLDB'97 . 426--435. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Dong Deng, Yufei Tao, and Guoliang Li. 2018. Overlap Set Similarity Joins with Theoretical Guarantees. In SIGMOD Conf. 2018 . 905--920.Google ScholarGoogle Scholar
  20. Jens-Peter Dittrich and Bernhard Seeger. 2001. GESS: a scalable similarity-join algorithm for mining large data sets in high dimensional spaces. In SIGKDD. 47--56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Vlastislav Dohnal, Claudio Gennaro, Pasquale Savino, and Pavel Zezula. 2003 b. D-Index: Distance Searching Index for Metric Data Sets. Multimedia Tools Appl., Vol. 21, 1 (2003), 9--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Vlastislav Dohnal, Claudio Gennaro, and Pavel Zezula. 2003 a. Similarity Join in Metric Spaces Using eD-Index. DEXA 2003. 484--493.Google ScholarGoogle Scholar
  23. Jack Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain S. Duff. 1990. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw., Vol. 16, 1 (1990), 1--17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Ramez Elmasri and Shamkant B. Navathe. 2006. Fundamentals of Database Systems .Addison Wesley, 5th edition.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Miguel Ferreira, Nuno Roma, and Lu'i s M. S. Russo. 2014. Cache-Oblivious parallel SIMD Viterbi decoding for sequence search in HMMER . BMC Bioinformatics, Vol. 15 (2014), 165.Google ScholarGoogle ScholarCross RefCross Ref
  26. Fabian Fier, Nikolaus Augsten, Panagiotis Bouros, Ulf Leser, and Johann-Christoph Freytag. 2018. Set Similarity Joins on MapReduce: An Experimental Survey. PVLDB, Vol. 11, 10 (2018), 1110--1122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. F.D. Fracchia, P. Prusinkiewicz, and A. Lindenmayer. 1991. Synthesis of Space-filling Curves on the Square Grid. In Fractals in the Fundamental and Applied Sciences. 341--366.Google ScholarGoogle Scholar
  28. Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. 1999. Cache-Oblivious Algorithms. In FOCS 1999. 285--298. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Ali Hadian and Saeed Shahrivari. 2014. High performance parallel k-means clustering for disk-resident datasets on multi-core CPUs. The Journal of Supercomputing, Vol. 69, 2 (2014), 845--863.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Bingsheng He, Yinan Li, Qiong Luo, and Dongqing Yang. 2007. EaseDB: a cache-oblivious in-memory query processor. In SIGMOD Conf. 2007 . 1064--1066.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. David Hilbert. 1891. Über die stetige Abbildung einer Linie auf ein Fl"achenstück. Math. Ann., Vol. 38 ( 1891).Google ScholarGoogle Scholar
  32. ThienLuan Ho, Seungrohk Oh, and Hyunjin Kim. 2018. New algorithms for fixed-length approximate string matching and approximate circular string matching under the Hamming distance. The Journal of Supercomputing, Vol. 74, 5 (2018), 1815--1834. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. James Jeffers, James Reinders, and Avinash Sodani. 2016. Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition 2Nd Edition . Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Dmitri V. Kalashnikov. 2013. Super-EGO: fast multi-dimensional similarity join. VLDB J., Vol. 22, 4 (2013), 561--585. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Dmitri V. Kalashnikov and Sunil Prabhakar. 2003. Similarity Join for Low-and High-Dimensional Data. (DASFAA '03). 7--16.Google ScholarGoogle Scholar
  36. Dmitri V. Kalashnikov and Sunil Prabhakar. 2007. Fast similarity join for multi-dimensional data. Inf. Syst., Vol. 32, 1 (2007), 160--177.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Nick Koudas and Kenneth C. Sevcik. 2000. High Dimensional Similarity Joins: Algorithms and Performance Evaluation. IEEE Trans. Knowl. Data Eng., Vol. 12, 1 (2000), 3--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Ye Li, Jian Wang, and Leong Hou U. 2016. Multidimensional Similarity Join Using MapReduce. Web-Age Information Management . 457--468.Google ScholarGoogle Scholar
  39. Michael D. Lieberman, Jagan Sankaranarayanan, and Hanan Samet. 2008. A Fast Similarity Join Algorithm Using Graphics Processing Units. In ICDE. 1111--1120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Youzhong Ma, Shijie Jia, and Yongxin Zhang. 2017. A novel approach for high-dimensional vector similarity join query. Concurrency and Computation: Practice and Experience, Vol. 29, 5 (2017).Google ScholarGoogle ScholarCross RefCross Ref
  41. Samuel McCauley and Francesco Silvestri. 2018. Adaptive MapReduce Similarity Joins. In SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond. 4:1--4:4.Google ScholarGoogle Scholar
  42. Yisroel Mirsky, Tomer Doitshman, Yuval Elovici, and Asaf Shabtai. 2018. Kitsune: An Ensemble of Autoencoders for Online Network Intrusion Detection. In Network and Distributed System Security Symposium, NDSS.Google ScholarGoogle Scholar
  43. Bochang Moon, Yongyoung Byun, Tae-Joon Kim, Pio Claudio, Hye-Sun Kim, Yun-Ji Ban, Seung Woo Nam, and Sung-Eui Yoon. 2010. Cache-oblivious ray reordering. ACM Trans. Graph., Vol. 29, 3 (2010). Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Jack A. Orenstein. 1986. Spatial Query Processing in an Object-oriented Database System. In SIGMOD Conf. 1986 . 326--336. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Rasmus Pagh, Ninh Pham, Francesco Silvestri, and Morten Stö ckel. 2017. I/O-Efficient Similarity Join. Algorithmica, Vol. 78, 4 (2017), 1263--1283. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Nicolas Papernot and Patrick D. McDaniel. 2018. Deep k-Nearest Neighbors: Towards Confident, Interpretable and Robust Deep Learning. CoRR, Vol. abs/1803.04765 (2018). arxiv: 1803.04765Google ScholarGoogle Scholar
  47. Rodrigo Paredes and Nora Reyes. 2009. Solving similarity joins and range queries in metric spaces with the list of twin clusters. J. Discrete Algorithms, Vol. 7, 1 (2009), 18--35.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Spencer S. Pearson and Yasin N. Silva. 2014. Index-Based R-S Similarity Joins. In SISAP. 106--112.Google ScholarGoogle Scholar
  49. P Prusinkiewicz. 1986. Graphical Applications of L-systems. In Proceedings on Graphics Interface '86/Vision Interface '86. 247--253. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Donovan A. Schneider and David J. DeWitt. 1989. A Performance Evaluation of Four Parallel Join Algorithms in a Shared-Nothing Multiprocessor Environment. In SIGMOD Conf. 1989. 110--121. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Zeyuan Shang, Yaxiao Liu, Guoliang Li, and Jianhua Feng. 2017. K-Join: Knowledge-Aware Similarity Join. In ICDE. 23--24.Google ScholarGoogle Scholar
  52. Rani Siromoney and K. G. Subramanian. 1983. Space-filling curves and infinite graphs. In Graph-Grammars and Their Application to Computer Science, Hartmut Ehrig, Manfred Nagl, and Grzegorz Rozenberg (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 380--391. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Allan Stisen, Henrik Blunck, Sourav Bhattacharya, Thor Siiger Prentow, Mikkel Baun Kjærgaard, Anind K. Dey, Tobias Sonne, and Mads Møller Jensen. 2015. Smart Devices are Different: Assessing and MitigatingMobile Sensing Heterogeneities for Activity Recognition. In Embedded Networked Sensor Systems, SenSys . 127--140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Ye Wang, Ahmed Metwally, and Srinivasan Parthasarathy. 2013. Scalable all-pairs similarity search in metric spaces. In SIGKDD. 829--837. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Chuan Xiao, Wei Wang, and Xuemin Lin. 2008. Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB, Vol. 1, 1 (2008), 933--944. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu, and Guoren Wang. 2011. Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst., Vol. 36, 3 (2011), 15:1--15:41.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Chenyun Yu, Sarana Nutanong, Hangyu Li, Cong Wang, and Xingliang Yuan. 2017. A Generic Method for Accelerating LSH-Based Similarity Join Processing. IEEE Trans. Knowl. Data Eng., Vol. 29, 4 (2017), 712--726. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Weijie Zhao, Florin Rusu, Bin Dong, and Kesheng Wu. 2016. Similarity Join over Array Data. In SIGMOD Conf. 2016. 2007--2022. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Cache-oblivious High-performance Similarity Join

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data
            June 2019
            2106 pages
            ISBN:9781450356435
            DOI:10.1145/3299869

            Copyright © 2019 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 25 June 2019

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            SIGMOD '19 Paper Acceptance Rate88of430submissions,20%Overall Acceptance Rate785of4,003submissions,20%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader