Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Filters








4,519 Hits in 7.2 sec

BOSS: Bottom-up Cross-modal Semantic Composition with Hybrid Counterfactual Training for Robust Content-based Image Retrieval [article]

Wenqiao Zhang, Jiannan Guo, Mengze Li, Haochen Shi, Shengyu Zhang, Juncheng Li, Siliang Tang, Yueting Zhuang
2022 arXiv   pre-print
On the one hand, we leverage the implicit interaction and composition of cross-modal embeddings from the bottom local characteristics to the top global semantics, preserving and transforming the visual  ...  Content-Based Image Retrieval (CIR) aims to search for a target image by concurrently comprehending the composition of an example image and a complementary text, which potentially impacts a wide variety  ...  To obtain the composed representation of the multi-modal query, existing approaches mainly resort to cross-modality interaction and fusion operation on the global semantics (e.g., top layer features from  ... 
arXiv:2207.04211v1 fatcat:cjondclijfhtzh5eojiwp5evre

Multi-Granularity Interactive Transformer Hashing for Cross-modal Retrieval

Yishu Liu, Qingpeng Wu, Zheng Zhang, Jingyi Zhang, Guangming Lu
2023 Proceedings of the 31st ACM International Conference on Multimedia  
Specifically, a well-designed distilled intra-modal interaction module is deployed to excavate modalityspecific concept knowledge with global-local knowledge distillation under the guidance of implicit  ...  To the best of our knowledge, this is the first attempt for multi-granularity transformerbased cross-modal hashing.  ...  The experiments contain two cross-modal retrieval tasks, i.e., image-to-text retrieval and text-to-image retrieval, which search relevant texts by querying images and searching relevant images by querying  ... 
doi:10.1145/3581783.3612411 fatcat:cn7hg7tztjfvtev2zjzl3dumgi

Progressive Learning for Image Retrieval with Hybrid-Modality Queries [article]

Yida Zhao, Yuqing Song, Qin Jin
2022 arXiv   pre-print
Image retrieval with hybrid-modality queries, also known as composing text and image for image retrieval (CTI-IR), is a retrieval task where the search intention is expressed in a more complex query format  ...  the importance of image and text in the hybrid-modality query for better retrieval.  ...  A hybrid-modality query composing module F (•) is applied on the reference image encoder and the text encoder to combine the multi-modal queries for the target image retrieval.  ... 
arXiv:2204.11212v1 fatcat:n6yqh2x3r5c7hpy7ca2hsp6yra

Progressive Learning for Image Retrieval with Hybrid-Modality Queries

Yida Zhao, Yuqing Song, Qin Jin
2022 Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval  
Image retrieval with hybrid-modality queries, also known as composing text and image for image retrieval (CTI-IR), is a retrieval task where the search intention is expressed in a more complex query format  ...  the importance of image and text in the hybrid-modality query for better retrieval.  ...  A hybrid-modality query composing module F (•) is applied on the reference image encoder and the text encoder to combine the multi-modal queries for the target image retrieval.  ... 
doi:10.1145/3477495.3532047 fatcat:qnqfia5d3ncd3b2g4rombx57ve

Training-free Zero-shot Composed Image Retrieval with Local Concept Reranking [article]

Shitong Sun, Fanghua Ye, Shaogang Gong
2024 arXiv   pre-print
Composed image retrieval attempts to retrieve an image of interest from gallery images through a composed query of a reference image and its corresponding modified text.  ...  In this work, we present a new training-free zero-shot composed image retrieval method which translates the query into explicit human-understandable text.  ...  An overview of the Training-Free Composed Image Retrieval (TFCIR) model which consists of: Top: Global Retrieval Baseline (GRB) transforms the text-image composed query into a text-only query with a Large  ... 
arXiv:2312.08924v2 fatcat:k676qfsubngcvem5vfkvslnkzq

Unify, Align and Refine: Multi-Level Semantic Alignment for Radiology Report Generation [article]

Yaowei Li and Bang Yang and Xuxin Cheng and Zhihong Zhu and Hongxiang Li and Yuexian Zou
2023 arXiv   pre-print
TIR boosts token-level local alignment via calibrating text-to-image attention with a learnable mask.  ...  However, simultaneously establishing global correspondences between the image (e.g., Chest X-ray) and its related report and local alignments between image patches and keywords remains challenging.  ...  to boost local cross-modal alignment.  ... 
arXiv:2303.15932v5 fatcat:zs5hhjagrbbkxhtab6fu7aohsy

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training [article]

Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, Jingjing Liu
2020 arXiv   pre-print
HERO encodes multimodal inputs in a hierarchical structure, where local context of a video frame is captured by a Cross-modal Transformer via multimodal fusion, and global video context is captured by  ...  HERO is jointly trained on HowTo100M and large-scale TV datasets to gain deep understanding of complex social dynamics with multi-character interactions.  ...  We follow the same procedure as in VSM to compute query-video matching scores both locally (frame-level, for moment retrieval) and globally (clip-level, for video retrieval).  ... 
arXiv:2005.00200v2 fatcat:skm6ktfgq5hpzhdsbmrajkbjcq

CRET

Kaixiang Ji, Jiajia Liu, Weixiang Hong, Liheng Zhong, Jian Wang, Jingdong Chen, Wei Chu
2022 Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval  
Specifically, the CCM module is composed by transformer decoders and a set of decoder centers.  ...  Given a text query, the text-to-video retrieval task aims to find the relevant videos in the database.  ...  transformer-based decoders, and finally produces locally-aligned features for both text and video modals.  ... 
doi:10.1145/3477495.3531960 fatcat:3itdzyx2vvgo7an2icsjz7vyky

Show Your Faith: Cross-Modal Confidence-Aware Network for Image-Text Matching

Huatian Zhang, Zhendong Mao, Kun Zhang, Yongdong Zhang
2022 AAAI Conference on Artificial Intelligence  
Image-text matching bridges vision and language, which is a crucial task in the field of multi-modal intelligence.  ...  However, although a region-word pair is locally matched across modalities, it may be inconsistent/unreliable from the global perspective of image-text, resulting in inaccurate relevance measurement.  ...  Funds for the Central Universities under Grants WK3480000008 and WK3480000010.  ... 
dblp:conf/aaai/ZhangMZ022 fatcat:pu2yllcsxngexlzda2hgij7rxy

TVT: Three-Way Vision Transformer through Multi-Modal Hypersphere Learning for Zero-Shot Sketch-Based Image Retrieval

Jialin Tian, Xing Xu, Fumin Shen, Yang Yang, Heng Tao Shen
2022 AAAI Conference on Artificial Intelligence  
In this paper, we study the zero-shot sketch-based image retrieval (ZS-SBIR) task, which retrieves natural images related to sketch queries from unseen categories.  ...  Furthermore, our method learns a multi-modal hypersphere by performing inter-and intra-modal alignment without loss of uniformity, which aims to bridge the modal gap between modalities of sketch and image  ...  Jialin Tian was with the internship in Meituan when this work was performed.  ... 
dblp:conf/aaai/TianXS0S22 fatcat:hssgw3baarg3nioiipdbm6zq54

Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders [article]

Nicola Messina, Giuseppe Amato, Andrea Esuli, Fabrizio Falchi, Claudio Gennaro, Stéphane Marchand-Maillet
2021 arXiv   pre-print
In this work, we tackle the task of cross-modal retrieval through image-sentence matching based on word-region alignments, using supervision only at the global image-sentence level.  ...  We argue that the fine-grained alignments produced by TERAN pave the way towards the research for effective and efficient methods for large-scale cross-modal information retrieval.  ...  The authors in [41, 42] extend the RN for producing compact features for relation-aware image retrieval. However, they do not explore the multi-modal retrieval setup.  ... 
arXiv:2008.05231v2 fatcat:h5ybwbeukjamviphhfykrcbpnu

T-EMDE: Sketching-based global similarity for cross-modal retrieval [article]

Barbara Rychalska, Mikolaj Wieczorek, Jacek Dabrowski
2021 arXiv   pre-print
The key challenge in cross-modal retrieval is to find similarities between objects represented with different modalities, such as image and text.  ...  It facilitates communication between modalities, as each global text/image representation is expressed with a standardized sketch histogram which represents the same manifold structures irrespective of  ...  [28] proposed Transformer LSH Attention, which uses Locally-Sensitive Hashing method to obtain hashes for both the Keys and the Queries.  ... 
arXiv:2105.04242v1 fatcat:6e4lzid6evdqdh2kw4sgt3rgna

Constructing Phrase-level Semantic Labels to Form Multi-Grained Supervision for Image-Text Retrieval

Zhihao Fan, Zhongyu Wei, Zejun Li, Siyuan Wang, Haijun Shan, Xuanjing Huang, Jianqing Fan
2022 Proceedings of the 2022 International Conference on Multimedia Retrieval  
Existing research for image text retrieval mainly relies on sentencelevel supervision to distinguish matched and mismatched sentences for a query image.  ...  For the training, we propose multi-scale matching from both global and local perspectives, and penalize mismatched phrases.  ...  We concatenate the sentence and its phrases in language side, while image and its regions in vision side, then present mask transformer for jointly cross-modality modeling with multi-grained semantics.  ... 
doi:10.1145/3512527.3531368 fatcat:umzmktgwazcbhjl4e2zbjj6jum

X-TRA: Improving Chest X-ray Tasks with Cross-Modal Retrieval Augmentation [article]

Tom van Sonsbeek, Marcel Worring
2023 arXiv   pre-print
We use this index in our downstream tasks to augment image representations through multi-head attention for disease classification and report retrieval.  ...  Our downstream report retrieval even shows to be competitive with dedicated report generation methods, paving the path for this method in medical imaging.  ...  Related Work Multi-modal alignment The introduction of Transformers for natural language processing (NLP) accelerated the development of integrated vision-language (VL) alignment models suitable for various  ... 
arXiv:2302.11352v1 fatcat:mdr4e7647zehrlpoqarneue454

CLIP2TV: Align, Match and Distill for Video-Text Retrieval [article]

Zijian Gao, Jingyu Liu, Weiqi Sun, Sheng Chen, Dedan Chang, Lili Zhao
2022 arXiv   pre-print
With the success on both visual and textual representation learning, transformer based encoders and fusion methods have also been adopted in the field of video-text retrieval.  ...  To achieve this, We first revisit some recent works on multi-modal learning, then introduce some techniques into video-text retrieval, finally evaluate them through extensive experiments in different configurations  ...  This justifies the necessity of aligning two modalities before sending them to multi-modal transformer. The 3rd row is CLIP2TV combining vta with vtm, retrieving the result from vta.  ... 
arXiv:2111.05610v2 fatcat:mweypvpbw5d6lbhd2y47o72zcy
« Previous Showing results 1 — 15 out of 4,519 results