基于注意力的新型聚合功能，以结合视觉和语言

论文标题

基于注意力的新型聚合功能，以结合视觉和语言

A Novel Attention-based Aggregation Function to Combine Vision and Language

论文作者

Stefanini, Matteo, Cornia, Marcella, Baraldi, Lorenzo, Cucchiara, Rita

论文摘要

在计算机视觉和自然语言处理社区中，人们对视觉和语言的共同理解已经引起了很多关注，以及诸如图像字幕，图像文本匹配和视觉问题回答之类的任务的出现。由于图像和文本都可以编码为元素的集或序列（例如区域和单词），因此需要适当的减少功能来将一组编码元素转换为单个响应，例如分类或相似性分数。在本文中，我们提出了一种新颖的视觉和语言减少方法。具体而言，我们的方法为每种模式的每个元素计算一组分数，采用新颖的跨注意变体，并执行可学习的跨模式降低，可用于分类和排名。我们在可可和VQA 2.0数据集上测试了图像文本匹配和视觉问题答案的方法，建立与其他减少选择的公平比较。在实验上，我们证明我们的方法会导致两项任务的性能提高。此外，我们进行消融研究以验证方法的每个组成部分的作用。

The joint understanding of vision and language has been recently gaining a lot of attention in both the Computer Vision and Natural Language Processing communities, with the emergence of tasks such as image captioning, image-text matching, and visual question answering. As both images and text can be encoded as sets or sequences of elements -- like regions and words -- proper reduction functions are needed to transform a set of encoded elements into a single response, like a classification or similarity score. In this paper, we propose a novel fully-attentive reduction method for vision and language. Specifically, our approach computes a set of scores for each element of each modality employing a novel variant of cross-attention, and performs a learnable and cross-modal reduction, which can be used for both classification and ranking. We test our approach on image-text matching and visual question answering, building fair comparisons with other reduction choices, on both COCO and VQA 2.0 datasets. Experimentally, we demonstrate that our approach leads to a performance increase on both tasks. Further, we conduct ablation studies to validate the role of each component of the approach.

下载PDF全文

下载文献需遵守相关版权规定

论文标题