Scalable Textual Similarity Search on Large Document Collections Through Random Indexing and K-means Clustering.

This paper presents an interactive multimedia search engine, which is capable of searching into multimedia collections by fusing textual and visual information. ... Apart from multimedia search, the engine is able to perform text search and image retrieval independently using both high-level and lowlevel information. ... ACKNOWLEDGMENTS This work was partially supported by the European Commission by the projects MULTISENSOR (FP7-610411), HOMER (FP7-312883) and KRISTINA (H2020-645012). ...

doi:10.1109/cbmi.2016.7500276 dblp:conf/cbmi/MoumtzidouGMLVK16 fatcat:zx3tgnl6dbcljaaezmakgmc72y

summaries for large textual data collection. ... advanced information indexing, searching, and classifica- tion techniques. ...

doi:10.1080/07421222.1999.11518256 fatcat:o7qr3xblcrfqtoxv6quhxs3lzq

In such approaches, clustering is performed on a dedicated node and also they are not suitable for deployment in large distributed networks. ... This centralized approach require high processing time and retrieving time during searching due to scalability of users. ... As a result a distributed version of K Means are used.K-Means algorithm can be summarized as: (1) Select k random starting points as initial centroids forthe k clusters. ...

doi:10.9790/0661-16582430 fatcat:wp5vigciebhwfi7egftsawxuq4

Yet, these PDFs remain largely unutilized and understudied in part due to the challenges surrounding the development of scalable pipelines for searching and analyzing them. ... In addition to demonstrating the utility of PDF metadata, this paper offers computationally-efficient machine learning approaches to search and discovery that utilize the PDFs' textual and visual features ... Even webpage indexing relies primarily on textual content, and this flat representation restricts our view on what searching the web and born-digital content could be. ...

arXiv:2112.02471v1 fatcat:yg2xrmgnwva2lpoc334ptiwpoa

Open Access

The simulations performed on a real dataset, i.e. the Google contest collection, show that our approach allows to obtain an IF index which is, depending on the d gap encoding chosen, up to 23% smaller ... Granting efficient accesses to the index is a key issue for the performances of Web Search Engines (WSE). ... In [12] the authors investigate the performance of different index compression schemes through experiments on large query sets and collections of Web Documents. ...

doi:10.1145/967900.968024 dblp:conf/sac/SilvestriPO04 fatcat:xyemltmfynaobay5xlpgzyllwe

This paper focuses on peer-to-peer information retrieval (P2PIR), which aims to retrieve textual documents based on their contents and ranks them based on some relevance measures against the query. ... The "open nature" of P2P systems and their lack of centralized control pose difficult challenges to the search capability and performance of P2PIR systems. ... Within the scope of this paper, P2PIR deals with textual documents and retrieval is based on some ranking measures computed between the query and the document texts. ...

doi:10.1145/1146847.1146896 dblp:conf/infoscale/LeeZL06 fatcat:wajwt4gkj5aqhgtmgipaumhw5m

The K-Means method used for retrieving the highest ranked top k objects near to the current location of the user and Time aware query suggestion brings out the most relevant documents based on the temporal ... the location of the user and the documents retrieved. ... Yuan Hung et al [14] proposed a Top -K search results for the query clustering based on the similarity of the ranked URL results returned by the search engine and the query is used for the clustering ...

doi:10.17148/ijarcce.2017.63157 fatcat:v7a3huyozzfh7hizniut5wuhiy

In this paper we present a method for organizing and indexing logo digital libraries like the ones of the patent and trademark offices. ... These descriptors are then indexed by a locality-sensitive hashing data structure aiming to perform approximate k-NN search in high dimensional spaces in sub-linear time. ... ACKNOWLEDGMENTS This work has been partially supported by the Spanish projects TIN2006-15694-C02-02, TIN2009-14633-C03-03 and CON-SOLIDER -INGENIO 2010 (CSD2007-00018). We would also thank Dr. R. ...

doi:10.1145/1815330.1815358 dblp:conf/das/RusinolL10 fatcat:6spwkxswfzgf7cvvb6uakfiola

This search technique first partitions the corpus, based on documents' similarity, into topic-based shards. ... This article investigates and extends an alternative: selective search, an approach that partitions the dataset based on document similarity to obtain topic-based shards, and searches only a few shards ... Sample-based K-means (SB K-means). We employ a variant of the time-tested K-means clustering algorithm [Lloyd 2006 ] to partition the documents based on their similarity. ...

doi:10.1145/2738035 fatcat:fistpgm5abemdeecnpiqmt4szi

retrieval performance on the training data. 2) To be scalable, millions of images together with rich textual information have been crawled from the Web to learn the similarity measure, and the learning ... framework particularly considers the indexing problem to ensure the retrieval efficiency. 3) To alleviate the noises in the unbalanced labels of images and fully utilize the textual information, a Latent ... This image collection is indexed based on K-means-based indexing method [8] using visual features. ...

doi:10.1145/1390334.1390396 dblp:conf/sigir/WangZZ08 fatcat:ihmuowpyk5gz3fq6bpmxayunqm

A common way to achieve this is first quantizing local descriptors into visual words, and then applying scalable textual indexing and retrieval schemes. ... In recent years, large-scale image retrieval shows significant potential in both industry applications and research problems. ... We then sequentially discuss the key point detection, local description, vocabulary generation, vector quantization, indexing and search. ...

arXiv:1304.5168v1 fatcat:6jycm42bg5guzgzlzdbc2pydby

Due to the reliance on the textual information associated with an image, image search engines on the Web lack the discriminative power to deliver visually diverse search results. ... Based on a performance evaluation we find that the outcome of the methods closely resembles human perception of diversity, which was established in an extensive clustering experiment carried out by human ... In [19] , we have presented a method for detecting and resolving the ambiguity of a query based on the textual features of the image collection. ...

doi:10.1145/1526709.1526756 dblp:conf/www/LeukenPOZ09 fatcat:h2b62otiunenvcpkfiypv2dmne

We propose an interactive learning approach that builds on and extends the state of the art in user relevance feedback systems and high-dimensional indexing for multimedia. ... We report on a detailed experimental study using the ImageNet and YFCC100M collections, containing 14 million and 100 million images respectively. ... This work was supported by a PhD grant from the IT University of Copenhagen and by the European Regional Development Fund (project Robotics for Industry 4.0, CZ.02.1.01/0.0/0.0/15 003/0000470). ...

doi:10.1007/978-3-030-45439-5_33 fatcat:f7ai4wg7evfjnpa7k3pqsedary

Finally, the candidate annotations are re-ranked using Random Walk with Restarts and only the top ones are reserved as the final annotations. ... First, content-based image retrieval technology is used to retrieve a set of visually similar images from a large-scale Web image set. ... To speed up the similarity search process, a K-means-based indexing algorithm is used [13] . ...

doi:10.1007/s00530-008-0128-y fatcat:vtmczcd5qva5tcqcufsbdg7csa

LINGO works on two main steps; Cluster Label Induction by using Latent Semantic Indexing technique (LSI) and Cluster content discovery by using the Vector Space Model (VSM). ... From theoretical evidence using Okapi BM25 for scoring method in LSI (LSI+Okapi BM25) for cluster content discovery instead of VSM, also results in better clusters generation in terms of scalability and ... Due to the large size of data which has some sort of exponential property, it is necessary to provide a scalable framework which further capable of indexing and searching web contents i.e. ...

arXiv:2112.08486v1 fatcat:q4jr2zlfhbh4lcqxdosto6jrxu

Open Access

A multimedia interactive search engine based on graph-based and non-linear multimodal fusion

Preserved Fulltext

Verifying the Proximity and Size Hypothesis for Self-Organizing Maps

Preserved Fulltext

Text Clustering in Distributed Networks with Enhanced File Security

Preserved Fulltext

Grappling with the Scale of Born-Digital Government Publications: Toward Pipelines for Processing and Searching Millions of PDFs [article]

Preserved Fulltext

Assigning document identifiers to enhance compressibility of Web Search Engines indexes

Preserved Fulltext

Information retrieval in a peer-to-peer environment

Preserved Fulltext

Spatiotemporal Keyword Query Suggestion Based On Document Proximity and K-Means Method– A Review

Preserved Fulltext

Efficient logo retrieval through hashing shape context descriptors

Preserved Fulltext

Selective Search

Preserved Fulltext

Learning to reduce the semantic gap in web image retrieval and annotation

Preserved Fulltext

Image Retrieval based on Bag-of-Words model [article]

Preserved Fulltext

Visual diversification of image search results

Preserved Fulltext

Interactive Learning for Multimedia at Large [chapter]

Preserved Fulltext

Scalable search-based image annotation

Preserved Fulltext

Text Mining Through Label Induction Grouping Algorithm Based Method [article]

Preserved Fulltext