A pivot-based filtering algorithm for enhancing query performance of LSH.

In this paper, we analyze a phenomenon we called "Non-Uniform" that degrades the query performance of LSH and propose a pivot-based algorithm to improve the query performance. ... We also provide a method to get optimal pivot for even larger improvement. Experiments show that our algorithm significantly improves the query performance of LSH. I. ... However, when using LSH for query, a final filtering process based on exact similarity measure is needed. ...

doi:10.1109/vcip.2011.6115941 dblp:conf/vcip/ZhangGZZL11 fatcat:f2k5tktnjnbzhg6iwjkcvtp2ua

We present a filter-and-refine method to speed up nearest neighbor searches with the Kullback-Leibler divergence for multivariate Gaussians. ... Overall the method accelerates the search for similar music pieces by a factor of 10-30 and yields high recall values of 95-99% compared to a standard linear search. ... For very high-dimensional data Locality Sensitive Hashing (LSH, [1] ) should be used as the afore mentioned algorithms are likely to perform worse or equal than a linear scan with very high dimensional ...

doi:10.1007/s11042-010-0679-8 fatcat:bjqi4ujejbgtlm4b4uzhaxz4fa

In this paper, we propose a generic concept that uses both lower and upper bound properties based on the Metric Spaces Theory to increase the avoidance of element comparisons. ... We analyzed the prunability power increase and show an example of its application on classical join nested loops algorithms. ... Acknowledgments: The authors would like to thank FAPESP (Fundação de Amparo à Pesquisa do Estado de São Paulo) for the financial support. ...

doi:10.3390/info9050124 fatcat:ix6dyzasebenridcib3jx2ifqq

DOAJ Szczepanski

While an exact brute-force k-NN search using this similarity function is slow, we demonstrate that an approximate algorithm can be nearly two orders of magnitude faster at the expense of only a small loss ... We address this issue by replacing the term-based search with a generic k-NN retrieval algorithm, where a similarity function can take into account subtle term associations. ... We also thank Di Wang for helping with a Lucene baseline; Chris Dyer for a discussion of IBM Model 1 efficiency; Yoav Goldberg, Manaal Faruqui, Chenyan Xiong, Ruey-Cheng Chen for discussions related to ...

doi:10.1145/2983323.2983815 dblp:conf/cikm/BoytsovNMN16 fatcat:7u24e5nm6fev3ni2rppgxfl4cm

Multiple Versions

Similarity query is the family of queries based on some similarity metrics. ... Then we talk about recent approaches on designing the indexes and operators for highly efficient similarity query processing on top of embeddings (or more generally, high dimensional data). ... In short, any query looking for answers based on similarity between records instead of exact value match is a similarity query. ...

arXiv:2204.07922v1 fatcat:u5osyghs6vgppnj5gpnrzhae5y

Results indicate that our approach is significantly better than the state-of-the-art algorithms (up to 49% enhancement on existing users and 115% enhancement on new users). ... We propose to use a rich feature set to represent users, according to their web browsing history and search queries. ... In user collaborative filtering such as [3] , the algorithm computes the similarity between users based on items they liked. ...

doi:10.1145/2736277.2741667 dblp:conf/www/ElkahkySH15 fatcat:dbvcoir2qngppc2kqdtxk4kuqi

knowledge and of Filtering for high similarity thresholds. ... This includes large volumes of semi-structured data, which pose challenges not only to the scalability of efficiency techniques, but also to their core assumptions: the requirement of Blocking for schema ... MinHash LSH is combined with SN in [92] : when searching for the nearest neighbors of a query entity, the entities in large LSH blocks are sorted via a custom scoring function and, then, a window of fixed ...

arXiv:1905.06167v4 fatcat:zoodv75tazg23cfnq4dwfgt6ge

Multiple Versions

This plethora of content created the need for finding the desired media in the social media universe. ... Moreover, the diversity of the available content, inspired users to demand and formulate more complicated queries. ... Finally, in Fig. 4 , we evaluate the retrieval accuracy of LSH, by performing 1000 top-100 queries (denoted by 100-NN queries) and varying the dimensionality of the SIFT datasets. ...

doi:10.1007/978-1-4471-4555-4_3 dblp:series/ccn/SemertzidisRTSD13 fatcat:6l3whv5qcjgshmjwcqh6dgelb4

In this survey, we review a large number of relevant works under two different but related frameworks: Blocking and Filtering. ... For each framework we provide a comprehensive list of the relevant works, discussing them in the greater context. We conclude with the most promising directions for future work in the field. ... MinHash LSH is combined with SN in [88] : when searching for the nearest neighbors of a query entity, the entities in large LSH blocks are sorted via a custom scoring function, and then a window of fixed ...

doi:10.1145/3377455 fatcat:uuzuuxwwzrfg7cwfwzswdqvklm

To address this limitation, we propose an efficient data collection method Query of CC based on large language models. ... Large language models have demonstrated remarkable potential in various tasks, however, there remains a significant scarcity of open-source models and data for specific domains. ... Exploring the potential impact of retriever selection on the quality of collected data might be a pivotal direction for future research. ...

arXiv:2401.14624v3 fatcat:rivf3fuewbfr3hzwbk34jmoh2m

Open Access Multiple Versions

A taxonomy of indexing techniques is proposed to enable researchers to understand and select the techniques that will serve as a basis for designing a new indexing scheme. ... The purpose of this paper is to examine and review existing indexing techniques for large-scale data. ... Thus, several challenging areas of research can serve as a basis for possible future research directions for the indexing of large IoT data. ...

doi:10.3390/fi14010019 fatcat:xnlzg7cs2fb3lgng65ha5ucf5m

DOAJ Szczepanski

To efficiently find joinable tables with similarity, we propose a block-and-verify method that utilizes pivot-based filtering. ... In this paper, we propose PEXESO, a framework for joinable table discovery in data lakes. ... Takuma Nozawa (NEC Corporation) for discussions. ...

arXiv:2010.13273v4 fatcat:jg5g4jrqnfhpjelexhzwlk4ecy

Multiple Versions

This can be achieved by first off offline learning bitwise weights of the hash codes for a various set of predefined linguistics thought categories. ... This paper introduces associate approach that allows query-adaptive ranking of the came pictures with equal playing distances to the queries. ... Thereafter, a progressive algorithm with adaptive filter technique was proposed for efficient skyline computation in this environment and summarizes the key principles of algorithm into a query routing ...

doi:10.17148/ijarcce.2017.6342 fatcat:7f36fh4drfe5tcoar6pj4dmvbu

Similarity search (nearest neighbor search) is a problem of pursuing the data items whose distances to a query item are the smallest from a large database. ... Various methods have been developed to address this problem, and recently a lot of efforts have been devoted to approximate search. ... For the subset, an LSH scheme is conducted. The query process first locates a bucket from outer hash tables for a query. If the bucket is empty, the algorithm stops. ...

arXiv:1408.2927v1 fatcat:reknwesjnbafvcbouyudrzp4rq

Specifically, to meet the efficiency requirement of the initial stage, we introduce a neural index for the neural representations of documents, and propose two hybrid search schemes based on both neural ... Vocabulary mismatch is a central problem in information retrieval (IR), i.e., the relevant documents may not contain the same (symbolic) terms of the query. ... Given a query, the documents sharing a prespecified number of k-NPs with the query are filtered to compute real distance. ...

arXiv:1806.10869v2 fatcat:f7ggl2nnszchzdhqmkupfc63y4

Multiple Versions

A pivot-based filtering algorithm for enhancing query performance of LSH

Preserved Fulltext

A fast audio similarity retrieval method for millions of music tracks

Preserved Fulltext

Double Distance-Calculation-Pruning for Similarity Search

Preserved Fulltext

Off the Beaten Path

Preserved Fulltext

A Survey on Efficient Processing of Similarity Queries over Neural Embeddings [article]

Preserved Fulltext

A Multi-View Deep Learning Approach for Cross Domain User Modeling in Recommendation Systems

Preserved Fulltext

A Survey of Blocking and Filtering Techniques for Entity Resolution [article]

Preserved Fulltext

Other Versions

Multimedia Indexing, Search, and Retrieval in Large Databases of Social Networks [chapter]

Preserved Fulltext

Blocking and Filtering Techniques for Entity Resolution

Preserved Fulltext

Query of CC: Unearthing Large Scale Domain-Specific Knowledge from Public Corpora [article]

Preserved Fulltext

Other Versions

A Survey on Big IoT Data Indexing: Potential Solutions, Recent Advancements, and Open Issues

Preserved Fulltext

Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach [article]

Preserved Fulltext

Other Versions

Content-Based Image Retrieval by Query Adaptive Search using Hash Codes

Preserved Fulltext

Hashing for Similarity Search: A Survey [article]

Preserved Fulltext

Beyond Precision: A Study on Recall of Initial Retrieval with Neural Representations [article]

Preserved Fulltext

Other Versions