中国矿业大学主页平台系统赵作鹏--中文主页-- Masking-Based Cross-Modal Remote Sensing Image–Text Retrieval via Dynamic Contrastive Learning

赵作鹏calvin

副教授*

副教授* 硕士生导师

电子邮箱：

入职时间：2005-07-01

所在单位：计算机科学与技术学院

学历：博士研究生毕业

办公地点：计算机楼A315-1、B610

在职信息：在岗

论文成果

当前位置: 中文主页 >> 科学研究 >> 论文成果

Masking-Based Cross-Modal Remote Sensing Image–Text Retrieval via Dynamic Contrastive Learning

发布时间：2024-07-21 点击次数：

影响因子：8.2
所属单位：中国矿业大学
发表刊物：IEEE Transactions on Geoscience and Remote Sensing
关键字：Cross-modal remote-sensing image–text retrieval (CMRSITR), masked image modeling (MIM), masked language modeling (MLM), momentum contrast
摘要：— Cross-modal remote sensing image–text retrieval (CMRSITR) aims to extract comprehensive information from diverse modalities. The primary challenge in this field is developing effective mappings between visual and textual modalities to a shared latent space. Existing approaches generally focus on utilizing pretrained unimodal models to independently extract features from each modality. However, these techniques often fall short of achieving the critical alignment necessary for effective cross-modal matching. These techniques predominantly concentrate on the extraction of features and alignment at an instance level, suggesting potential areas for enhancement. To address these limitations, we introduce the masked interaction inferring and aligning (MIIA) framework, utilizing dynamic contrastive learning (DCL). This framework is adept at discerning the intricate relationships between local visual–textual tokens, thereby significantly bolstering the congruence of global image–text pairings without relying on additional prior supervision. Initially, we devise a masked interaction inferring (MII) module, which fosters token-level interplays through a novel masked visual-language (VL) modeling approach. Following this, we implement a cross-modal DCL mechanism, which is instrumental in capturing and aligning semantic correlations between images and texts more effectively. Finally, to ensure the comprehensive matching of visual and textual embeddings, we introduce a unique technique known as bidirectional distribution matching (BDM). This method is designed to minimize the Kullback–Leibler (KL) divergence between the distributions of image–text similarity, computed using the negative queues in momentum contrast learning. Comprehensive experiments performed on well-established public datasets consistently validate the state-of-the-art performance of MIIA methods in the CMRSITR task.
论文类型：期刊论文
论文编号：5626215
学科门类：工学
一级学科：计算机科学与技术
文献类型：J
卷号：62
期号：2024
是否译文：否
发表时间：2024-06-21
收录刊物：SCI

下一条：YOLO-PAI: Real-time handheld call behavior detection algorithm and embedded application

个人信息

赵作鹏calvin

同专业硕导

论文成果

Masking-Based Cross-Modal Remote Sensing Image–Text Retrieval via Dynamic Contrastive Learning