A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2022; you can also visit the original URL.
The file type is application/pdf
.
Filters
BOSS: Bottom-up Cross-modal Semantic Composition with Hybrid Counterfactual Training for Robust Content-based Image Retrieval
[article]
2022
arXiv
pre-print
On the one hand, we leverage the implicit interaction and composition of cross-modal embeddings from the bottom local characteristics to the top global semantics, preserving and transforming the visual ...
Content-Based Image Retrieval (CIR) aims to search for a target image by concurrently comprehending the composition of an example image and a complementary text, which potentially impacts a wide variety ...
To obtain the composed representation of the multi-modal query, existing approaches mainly resort to cross-modality interaction and fusion operation on the global semantics (e.g., top layer features from ...
arXiv:2207.04211v1
fatcat:cjondclijfhtzh5eojiwp5evre
Multi-Granularity Interactive Transformer Hashing for Cross-modal Retrieval
2023
Proceedings of the 31st ACM International Conference on Multimedia
Specifically, a well-designed distilled intra-modal interaction module is deployed to excavate modalityspecific concept knowledge with global-local knowledge distillation under the guidance of implicit ...
To the best of our knowledge, this is the first attempt for multi-granularity transformerbased cross-modal hashing. ...
The experiments contain two cross-modal retrieval tasks, i.e., image-to-text retrieval and text-to-image retrieval, which search relevant texts by querying images and searching relevant images by querying ...
doi:10.1145/3581783.3612411
fatcat:cn7hg7tztjfvtev2zjzl3dumgi
Progressive Learning for Image Retrieval with Hybrid-Modality Queries
[article]
2022
arXiv
pre-print
Image retrieval with hybrid-modality queries, also known as composing text and image for image retrieval (CTI-IR), is a retrieval task where the search intention is expressed in a more complex query format ...
the importance of image and text in the hybrid-modality query for better retrieval. ...
A hybrid-modality query composing module F (•) is applied on the reference image encoder and the text encoder to combine the multi-modal queries for the target image retrieval. ...
arXiv:2204.11212v1
fatcat:n6yqh2x3r5c7hpy7ca2hsp6yra
Progressive Learning for Image Retrieval with Hybrid-Modality Queries
2022
Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval
Image retrieval with hybrid-modality queries, also known as composing text and image for image retrieval (CTI-IR), is a retrieval task where the search intention is expressed in a more complex query format ...
the importance of image and text in the hybrid-modality query for better retrieval. ...
A hybrid-modality query composing module F (•) is applied on the reference image encoder and the text encoder to combine the multi-modal queries for the target image retrieval. ...
doi:10.1145/3477495.3532047
fatcat:qnqfia5d3ncd3b2g4rombx57ve
Training-free Zero-shot Composed Image Retrieval with Local Concept Reranking
[article]
2024
arXiv
pre-print
Composed image retrieval attempts to retrieve an image of interest from gallery images through a composed query of a reference image and its corresponding modified text. ...
In this work, we present a new training-free zero-shot composed image retrieval method which translates the query into explicit human-understandable text. ...
An overview of the Training-Free Composed Image Retrieval (TFCIR) model which consists of: Top: Global Retrieval Baseline (GRB) transforms the text-image composed query into a text-only query with a Large ...
arXiv:2312.08924v2
fatcat:k676qfsubngcvem5vfkvslnkzq
Unify, Align and Refine: Multi-Level Semantic Alignment for Radiology Report Generation
[article]
2023
arXiv
pre-print
TIR boosts token-level local alignment via calibrating text-to-image attention with a learnable mask. ...
However, simultaneously establishing global correspondences between the image (e.g., Chest X-ray) and its related report and local alignments between image patches and keywords remains challenging. ...
to boost local cross-modal alignment. ...
arXiv:2303.15932v5
fatcat:zs5hhjagrbbkxhtab6fu7aohsy
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
[article]
2020
arXiv
pre-print
HERO encodes multimodal inputs in a hierarchical structure, where local context of a video frame is captured by a Cross-modal Transformer via multimodal fusion, and global video context is captured by ...
HERO is jointly trained on HowTo100M and large-scale TV datasets to gain deep understanding of complex social dynamics with multi-character interactions. ...
We follow the same procedure as in VSM to compute query-video matching scores both locally (frame-level, for moment retrieval) and globally (clip-level, for video retrieval). ...
arXiv:2005.00200v2
fatcat:skm6ktfgq5hpzhdsbmrajkbjcq
CRET
2022
Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval
Specifically, the CCM module is composed by transformer decoders and a set of decoder centers. ...
Given a text query, the text-to-video retrieval task aims to find the relevant videos in the database. ...
transformer-based decoders, and finally produces locally-aligned features for both text and video modals. ...
doi:10.1145/3477495.3531960
fatcat:3itdzyx2vvgo7an2icsjz7vyky
Show Your Faith: Cross-Modal Confidence-Aware Network for Image-Text Matching
2022
AAAI Conference on Artificial Intelligence
Image-text matching bridges vision and language, which is a crucial task in the field of multi-modal intelligence. ...
However, although a region-word pair is locally matched across modalities, it may be inconsistent/unreliable from the global perspective of image-text, resulting in inaccurate relevance measurement. ...
Funds for the Central Universities under Grants WK3480000008 and WK3480000010. ...
dblp:conf/aaai/ZhangMZ022
fatcat:pu2yllcsxngexlzda2hgij7rxy
TVT: Three-Way Vision Transformer through Multi-Modal Hypersphere Learning for Zero-Shot Sketch-Based Image Retrieval
2022
AAAI Conference on Artificial Intelligence
In this paper, we study the zero-shot sketch-based image retrieval (ZS-SBIR) task, which retrieves natural images related to sketch queries from unseen categories. ...
Furthermore, our method learns a multi-modal hypersphere by performing inter-and intra-modal alignment without loss of uniformity, which aims to bridge the modal gap between modalities of sketch and image ...
Jialin Tian was with the internship in Meituan when this work was performed. ...
dblp:conf/aaai/TianXS0S22
fatcat:hssgw3baarg3nioiipdbm6zq54
Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders
[article]
2021
arXiv
pre-print
In this work, we tackle the task of cross-modal retrieval through image-sentence matching based on word-region alignments, using supervision only at the global image-sentence level. ...
We argue that the fine-grained alignments produced by TERAN pave the way towards the research for effective and efficient methods for large-scale cross-modal information retrieval. ...
The authors in [41, 42] extend the RN for producing compact features for relation-aware image retrieval. However, they do not explore the multi-modal retrieval setup. ...
arXiv:2008.05231v2
fatcat:h5ybwbeukjamviphhfykrcbpnu
T-EMDE: Sketching-based global similarity for cross-modal retrieval
[article]
2021
arXiv
pre-print
The key challenge in cross-modal retrieval is to find similarities between objects represented with different modalities, such as image and text. ...
It facilitates communication between modalities, as each global text/image representation is expressed with a standardized sketch histogram which represents the same manifold structures irrespective of ...
[28] proposed Transformer LSH Attention, which uses Locally-Sensitive Hashing method to obtain hashes for both the Keys and the Queries. ...
arXiv:2105.04242v1
fatcat:6e4lzid6evdqdh2kw4sgt3rgna
Constructing Phrase-level Semantic Labels to Form Multi-Grained Supervision for Image-Text Retrieval
2022
Proceedings of the 2022 International Conference on Multimedia Retrieval
Existing research for image text retrieval mainly relies on sentencelevel supervision to distinguish matched and mismatched sentences for a query image. ...
For the training, we propose multi-scale matching from both global and local perspectives, and penalize mismatched phrases. ...
We concatenate the sentence and its phrases in language side, while image and its regions in vision side, then present mask transformer for jointly cross-modality modeling with multi-grained semantics. ...
doi:10.1145/3512527.3531368
fatcat:umzmktgwazcbhjl4e2zbjj6jum
X-TRA: Improving Chest X-ray Tasks with Cross-Modal Retrieval Augmentation
[article]
2023
arXiv
pre-print
We use this index in our downstream tasks to augment image representations through multi-head attention for disease classification and report retrieval. ...
Our downstream report retrieval even shows to be competitive with dedicated report generation methods, paving the path for this method in medical imaging. ...
Related Work Multi-modal alignment The introduction of Transformers for natural language processing (NLP) accelerated the development of integrated vision-language (VL) alignment models suitable for various ...
arXiv:2302.11352v1
fatcat:mdr4e7647zehrlpoqarneue454
CLIP2TV: Align, Match and Distill for Video-Text Retrieval
[article]
2022
arXiv
pre-print
With the success on both visual and textual representation learning, transformer based encoders and fusion methods have also been adopted in the field of video-text retrieval. ...
To achieve this, We first revisit some recent works on multi-modal learning, then introduce some techniques into video-text retrieval, finally evaluate them through extensive experiments in different configurations ...
This justifies the necessity of aligning two modalities before sending them to multi-modal transformer. The 3rd row is CLIP2TV combining vta with vtm, retrieving the result from vta. ...
arXiv:2111.05610v2
fatcat:mweypvpbw5d6lbhd2y47o72zcy
« Previous
Showing results 1 — 15 out of 4,519 results