论文标题
基于注意力的新型聚合功能,以结合视觉和语言
A Novel Attention-based Aggregation Function to Combine Vision and Language
论文作者
论文摘要
在计算机视觉和自然语言处理社区中,人们对视觉和语言的共同理解已经引起了很多关注,以及诸如图像字幕,图像文本匹配和视觉问题回答之类的任务的出现。由于图像和文本都可以编码为元素的集或序列(例如区域和单词),因此需要适当的减少功能来将一组编码元素转换为单个响应,例如分类或相似性分数。在本文中,我们提出了一种新颖的视觉和语言减少方法。具体而言,我们的方法为每种模式的每个元素计算一组分数,采用新颖的跨注意变体,并执行可学习的跨模式降低,可用于分类和排名。我们在可可和VQA 2.0数据集上测试了图像文本匹配和视觉问题答案的方法,建立与其他减少选择的公平比较。在实验上,我们证明我们的方法会导致两项任务的性能提高。此外,我们进行消融研究以验证方法的每个组成部分的作用。
The joint understanding of vision and language has been recently gaining a lot of attention in both the Computer Vision and Natural Language Processing communities, with the emergence of tasks such as image captioning, image-text matching, and visual question answering. As both images and text can be encoded as sets or sequences of elements -- like regions and words -- proper reduction functions are needed to transform a set of encoded elements into a single response, like a classification or similarity score. In this paper, we propose a novel fully-attentive reduction method for vision and language. Specifically, our approach computes a set of scores for each element of each modality employing a novel variant of cross-attention, and performs a learnable and cross-modal reduction, which can be used for both classification and ranking. We test our approach on image-text matching and visual question answering, building fair comparisons with other reduction choices, on both COCO and VQA 2.0 datasets. Experimentally, we demonstrate that our approach leads to a performance increase on both tasks. Further, we conduct ablation studies to validate the role of each component of the approach.