A Memory-Efficient Sketch Method for Estimating High Similarities in Streaming Sets

Wang, Pinghui; Qi, Yiyan; Zhang, Yuanming; Zhai, Qiaozhu; Wang, Chenxu; Lui, John C. S.; Guan, Xiaohong

doi:10.1145/3292500.3330825

Abstract:Estimating set similarity and detecting highly similar sets are fundamental problems in areas such as databases, machine learning, and information retrieval. MinHash is a well-known technique for approximating Jaccard similarity of sets and has been successfully used for many applications such as similarity search and large scale learning. Its two compressed versions, b-bit MinHash and Odd Sketch, can significantly reduce the memory usage of the original MinHash method, especially for estimating high similarities (i.e., similarities around 1). Although MinHash can be applied to static sets as well as streaming sets, of which elements are given in a streaming fashion and cardinality is unknown or even infinite, unfortunately, b-bit MinHash and Odd Sketch fail to deal with streaming data. To solve this problem, we design a memory efficient sketch method, MaxLogHash, to accurately estimate Jaccard similarities in streaming sets. Compared to MinHash, our method uses smaller sized registers (each register consists of less than 7 bits) to build a compact sketch for each set. We also provide a simple yet accurate estimator for inferring Jaccard similarity from MaxLogHash sketches. In addition, we derive formulas for bounding the estimation error and determine the smallest necessary memory usage (i.e., the number of registers used for a MaxLogHash sketch) for the desired accuracy. We conduct experiments on a variety of datasets, and experimental results show that our method MaxLogHash is about 5 times more memory efficient than MinHash with the same accuracy and computational cost for estimating high similarities.

Subjects:	Data Structures and Algorithms (cs.DS)
Cite as:	arXiv:1905.08977 [cs.DS]
	(or arXiv:1905.08977v1 [cs.DS] for this version)
	https://doi.org/10.48550/arXiv.1905.08977
Related DOI:	https://doi.org/10.1145/3292500.3330825

Computer Science > Data Structures and Algorithms

Title:A Memory-Efficient Sketch Method for Estimating High Similarities in Streaming Sets

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators