Multi-Modal Transformer With Global-Local Alignment for Composed Query Image Retrieval.

On the one hand, we leverage the implicit interaction and composition of cross-modal embeddings from the bottom local characteristics to the top global semantics, preserving and transforming the visual ... Content-Based Image Retrieval (CIR) aims to search for a target image by concurrently comprehending the composition of an example image and a complementary text, which potentially impacts a wide variety ... To obtain the composed representation of the multi-modal query, existing approaches mainly resort to cross-modality interaction and fusion operation on the global semantics (e.g., top layer features from ...

arXiv:2207.04211v1 fatcat:cjondclijfhtzh5eojiwp5evre

Specifically, a well-designed distilled intra-modal interaction module is deployed to excavate modalityspecific concept knowledge with global-local knowledge distillation under the guidance of implicit ... To the best of our knowledge, this is the first attempt for multi-granularity transformerbased cross-modal hashing. ... The experiments contain two cross-modal retrieval tasks, i.e., image-to-text retrieval and text-to-image retrieval, which search relevant texts by querying images and searching relevant images by querying ...

doi:10.1145/3581783.3612411 fatcat:cn7hg7tztjfvtev2zjzl3dumgi

Image retrieval with hybrid-modality queries, also known as composing text and image for image retrieval (CTI-IR), is a retrieval task where the search intention is expressed in a more complex query format ... the importance of image and text in the hybrid-modality query for better retrieval. ... A hybrid-modality query composing module F (•) is applied on the reference image encoder and the text encoder to combine the multi-modal queries for the target image retrieval. ...

arXiv:2204.11212v1 fatcat:n6yqh2x3r5c7hpy7ca2hsp6yra

Image retrieval with hybrid-modality queries, also known as composing text and image for image retrieval (CTI-IR), is a retrieval task where the search intention is expressed in a more complex query format ... the importance of image and text in the hybrid-modality query for better retrieval. ... A hybrid-modality query composing module F (•) is applied on the reference image encoder and the text encoder to combine the multi-modal queries for the target image retrieval. ...

doi:10.1145/3477495.3532047 fatcat:qnqfia5d3ncd3b2g4rombx57ve

Composed image retrieval attempts to retrieve an image of interest from gallery images through a composed query of a reference image and its corresponding modified text. ... In this work, we present a new training-free zero-shot composed image retrieval method which translates the query into explicit human-understandable text. ... An overview of the Training-Free Composed Image Retrieval (TFCIR) model which consists of: Top: Global Retrieval Baseline (GRB) transforms the text-image composed query into a text-only query with a Large ...

arXiv:2312.08924v2 fatcat:k676qfsubngcvem5vfkvslnkzq

Multiple Versions

TIR boosts token-level local alignment via calibrating text-to-image attention with a learnable mask. ... However, simultaneously establishing global correspondences between the image (e.g., Chest X-ray) and its related report and local alignments between image patches and keywords remains challenging. ... to boost local cross-modal alignment. ...

arXiv:2303.15932v5 fatcat:zs5hhjagrbbkxhtab6fu7aohsy

Multiple Versions

HERO encodes multimodal inputs in a hierarchical structure, where local context of a video frame is captured by a Cross-modal Transformer via multimodal fusion, and global video context is captured by ... HERO is jointly trained on HowTo100M and large-scale TV datasets to gain deep understanding of complex social dynamics with multi-character interactions. ... We follow the same procedure as in VSM to compute query-video matching scores both locally (frame-level, for moment retrieval) and globally (clip-level, for video retrieval). ...

arXiv:2005.00200v2 fatcat:skm6ktfgq5hpzhdsbmrajkbjcq

Multiple Versions

Specifically, the CCM module is composed by transformer decoders and a set of decoder centers. ... Given a text query, the text-to-video retrieval task aims to find the relevant videos in the database. ... transformer-based decoders, and finally produces locally-aligned features for both text and video modals. ...

doi:10.1145/3477495.3531960 fatcat:3itdzyx2vvgo7an2icsjz7vyky

Image-text matching bridges vision and language, which is a crucial task in the field of multi-modal intelligence. ... However, although a region-word pair is locally matched across modalities, it may be inconsistent/unreliable from the global perspective of image-text, resulting in inaccurate relevance measurement. ... Funds for the Central Universities under Grants WK3480000008 and WK3480000010. ...

dblp:conf/aaai/ZhangMZ022 fatcat:pu2yllcsxngexlzda2hgij7rxy

In this paper, we study the zero-shot sketch-based image retrieval (ZS-SBIR) task, which retrieves natural images related to sketch queries from unseen categories. ... Furthermore, our method learns a multi-modal hypersphere by performing inter-and intra-modal alignment without loss of uniformity, which aims to bridge the modal gap between modalities of sketch and image ... Jialin Tian was with the internship in Meituan when this work was performed. ...

dblp:conf/aaai/TianXS0S22 fatcat:hssgw3baarg3nioiipdbm6zq54

In this work, we tackle the task of cross-modal retrieval through image-sentence matching based on word-region alignments, using supervision only at the global image-sentence level. ... We argue that the fine-grained alignments produced by TERAN pave the way towards the research for effective and efficient methods for large-scale cross-modal information retrieval. ... The authors in [41, 42] extend the RN for producing compact features for relation-aware image retrieval. However, they do not explore the multi-modal retrieval setup. ...

arXiv:2008.05231v2 fatcat:h5ybwbeukjamviphhfykrcbpnu

Multiple Versions

The key challenge in cross-modal retrieval is to find similarities between objects represented with different modalities, such as image and text. ... It facilitates communication between modalities, as each global text/image representation is expressed with a standardized sketch histogram which represents the same manifold structures irrespective of ... [28] proposed Transformer LSH Attention, which uses Locally-Sensitive Hashing method to obtain hashes for both the Keys and the Queries. ...

arXiv:2105.04242v1 fatcat:6e4lzid6evdqdh2kw4sgt3rgna

Existing research for image text retrieval mainly relies on sentencelevel supervision to distinguish matched and mismatched sentences for a query image. ... For the training, we propose multi-scale matching from both global and local perspectives, and penalize mismatched phrases. ... We concatenate the sentence and its phrases in language side, while image and its regions in vision side, then present mask transformer for jointly cross-modality modeling with multi-grained semantics. ...

doi:10.1145/3512527.3531368 fatcat:umzmktgwazcbhjl4e2zbjj6jum

We use this index in our downstream tasks to augment image representations through multi-head attention for disease classification and report retrieval. ... Our downstream report retrieval even shows to be competitive with dedicated report generation methods, paving the path for this method in medical imaging. ... Related Work Multi-modal alignment The introduction of Transformers for natural language processing (NLP) accelerated the development of integrated vision-language (VL) alignment models suitable for various ...

arXiv:2302.11352v1 fatcat:mdr4e7647zehrlpoqarneue454

With the success on both visual and textual representation learning, transformer based encoders and fusion methods have also been adopted in the field of video-text retrieval. ... To achieve this, We first revisit some recent works on multi-modal learning, then introduce some techniques into video-text retrieval, finally evaluate them through extensive experiments in different configurations ... This justifies the necessity of aligning two modalities before sending them to multi-modal transformer. The 3rd row is CLIP2TV combining vta with vtm, retrieving the result from vta. ...

arXiv:2111.05610v2 fatcat:mweypvpbw5d6lbhd2y47o72zcy

Open Access Multiple Versions

BOSS: Bottom-up Cross-modal Semantic Composition with Hybrid Counterfactual Training for Robust Content-based Image Retrieval [article]

Preserved Fulltext

Multi-Granularity Interactive Transformer Hashing for Cross-modal Retrieval

Preserved Fulltext

Progressive Learning for Image Retrieval with Hybrid-Modality Queries [article]

Preserved Fulltext

Progressive Learning for Image Retrieval with Hybrid-Modality Queries

Preserved Fulltext

Training-free Zero-shot Composed Image Retrieval with Local Concept Reranking [article]

Preserved Fulltext

Unify, Align and Refine: Multi-Level Semantic Alignment for Radiology Report Generation [article]

Preserved Fulltext

Other Versions

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training [article]

Preserved Fulltext

Other Versions

CRET

Preserved Fulltext

Show Your Faith: Cross-Modal Confidence-Aware Network for Image-Text Matching

Preserved Fulltext

TVT: Three-Way Vision Transformer through Multi-Modal Hypersphere Learning for Zero-Shot Sketch-Based Image Retrieval

Preserved Fulltext

Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders [article]

Preserved Fulltext

Other Versions

T-EMDE: Sketching-based global similarity for cross-modal retrieval [article]

Preserved Fulltext

Constructing Phrase-level Semantic Labels to Form Multi-Grained Supervision for Image-Text Retrieval

Preserved Fulltext

X-TRA: Improving Chest X-ray Tasks with Cross-Modal Retrieval Augmentation [article]

Preserved Fulltext

CLIP2TV: Align, Match and Distill for Video-Text Retrieval [article]

Preserved Fulltext

Other Versions