Vista：跨模式检索的视觉和场景文本聚合

论文标题

Vista：跨模式检索的视觉和场景文本聚合

ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval

论文作者

Cheng, Mengjun, Sun, Yipeng, Wang, Longchao, Zhu, Xiongwei, Yao, Kun, Chen, Jie, Song, Guoli, Han, Junyu, Liu, Jingtuo, Ding, Errui, Wang, Jingdong

论文摘要

视觉外观被认为是了解跨模式检索图像的最重要的提示，而有时出现在图像中的场景文本可以提供有价值的信息来理解视觉语义。现有的大多数跨模式检索方法都忽略了场景文本信息的用法，直接添加此信息可能会导致场景文本免费场景中的性能退化。为了解决这个问题，我们提出了一个完整的变压器体系结构，以在单个$ \ textbf {vi} $ sion和$ \ textbf {s} $ cene $ \ ceene $ \ textbf {t} $ ext $ ext $ \ ext $ \ textbf {a} $ gggregation（vista）中统一这些跨模式检索场景。具体而言，Vista利用变压器块直接编码图像补丁并融合场景文本嵌入，以学习跨模式检索的聚合视觉表示。为了解决场景文本缺失的方式，我们提出了一种基于融合令牌的新型变压器聚合方法，仅通过融合令牌来交换必要的场景文本信息，并专注于每种方式中最重要的特征。为了进一步增强视觉模式，我们开发出双重对比学习损失，将图像文本对和融合文本对嵌入到一个通用的跨模式空间中。与现有方法相比，VISTA可以通过视觉外观汇总相关的场景文本语义，因此在场景文本和场景文本意识方面的情况下都改善了结果。实验结果表明，Vista的表现胜过至少$ \ bf {8.4} \％$ at Recce@1的其他方法，用于场景文本意识到的检索任务。与最先进的场景文本免费检索方法相比，Vista在推理阶段的运行速度至少三倍时可以在Flicker30k和Mscoco上实现更好的准确性，这验证了提议的框架的有效性。

Visual appearance is considered to be the most important cue to understand images for cross-modal retrieval, while sometimes the scene text appearing in images can provide valuable information to understand the visual semantics. Most of existing cross-modal retrieval approaches ignore the usage of scene text information and directly adding this information may lead to performance degradation in scene text free scenarios. To address this issue, we propose a full transformer architecture to unify these cross-modal retrieval scenarios in a single $\textbf{Vi}$sion and $\textbf{S}$cene $\textbf{T}$ext $\textbf{A}$ggregation framework (ViSTA). Specifically, ViSTA utilizes transformer blocks to directly encode image patches and fuse scene text embedding to learn an aggregated visual representation for cross-modal retrieval. To tackle the modality missing problem of scene text, we propose a novel fusion token based transformer aggregation approach to exchange the necessary scene text information only through the fusion token and concentrate on the most important features in each modality. To further strengthen the visual modality, we develop dual contrastive learning losses to embed both image-text pairs and fusion-text pairs into a common cross-modal space. Compared to existing methods, ViSTA enables to aggregate relevant scene text semantics with visual appearance, and hence improve results under both scene text free and scene text aware scenarios. Experimental results show that ViSTA outperforms other methods by at least $\bf{8.4}\%$ at Recall@1 for scene text aware retrieval task. Compared with state-of-the-art scene text free retrieval methods, ViSTA can achieve better accuracy on Flicker30K and MSCOCO while running at least three times faster during the inference stage, which validates the effectiveness of the proposed framework.

下载PDF全文

下载文献需遵守相关版权规定

论文标题