使用变压器编码器的跨模式检索的细颗粒视觉文本对齐

论文标题

使用变压器编码器的跨模式检索的细颗粒视觉文本对齐

Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders

论文作者

Messina, Nicola, Amato, Giuseppe, Esuli, Andrea, Falchi, Fabrizio, Gennaro, Claudio, Marchand-Maillet, Stéphane

论文摘要

尽管基于深度学习的视觉处理系统的发展，精确的多模式匹配仍然是一项艰巨的任务。在这项工作中，我们仅在全局图像句子级别上使用监督，通过基于单词区域对齐方式来解决通过图像句子匹配的跨模式检索任务。具体而言，我们提出了一种称为变压器编码器推理和对齐网络（TERAN）的新颖方法。特兰（Teran）在图像和句子的基本组成部分（即图像区域和单词）之间实施了细粒度的匹配，以保留两种模态的信息丰富。 Teran在MS-Coco和Flickr30k数据集上的图像检索任务上获得了最新的结果。此外，在MS-Coco上，它还胜过句子检索任务的当前方法。 Teran专注于可扩展的跨模式信息检索，旨在使视觉和文本数据管道保持良好的分离。跨注意链接使在线搜索和大规模检索系统中的离线索引步骤单独提取视觉和文本功能的任何机会无效。在这方面，Teran仅在损失计算之前的最终比对阶段将两个域的信息合并。我们认为，特兰（Teran）产生的细粒度对准铺平了迈向研究的有效和有效方法的大规模跨模式信息检索的道路。我们将方法与相关最新方法的有效性进行了比较。在MS-Coco 1K测试集上，我们在图像和召回@1公制上的句子检索任务上分别获得了5.7％和3.5％的提高。用于实验的代码可在https://github.com/mesnico/teran上公开获得。

Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal matching remains a challenging task. In this work, we tackle the task of cross-modal retrieval through image-sentence matching based on word-region alignments, using supervision only at the global image-sentence level. Specifically, we present a novel approach called Transformer Encoder Reasoning and Alignment Network (TERAN). TERAN enforces a fine-grained match between the underlying components of images and sentences, i.e., image regions and words, respectively, in order to preserve the informative richness of both modalities. TERAN obtains state-of-the-art results on the image retrieval task on both MS-COCO and Flickr30k datasets. Moreover, on MS-COCO, it also outperforms current approaches on the sentence retrieval task. Focusing on scalable cross-modal information retrieval, TERAN is designed to keep the visual and textual data pipelines well separated. Cross-attention links invalidate any chance to separately extract visual and textual features needed for the online search and the offline indexing steps in large-scale retrieval systems. In this respect, TERAN merges the information from the two domains only during the final alignment phase, immediately before the loss computation. We argue that the fine-grained alignments produced by TERAN pave the way towards the research for effective and efficient methods for large-scale cross-modal information retrieval. We compare the effectiveness of our approach against relevant state-of-the-art methods. On the MS-COCO 1K test set, we obtain an improvement of 5.7% and 3.5% respectively on the image and the sentence retrieval tasks on the Recall@1 metric. The code used for the experiments is publicly available on GitHub at https://github.com/mesnico/TERAN.

下载PDF全文

下载文献需遵守相关版权规定

论文标题