论文标题
在语言上意识到的关注,以减少视觉任务中的语义差距
Linguistically-aware Attention for Reducing the Semantic-Gap in Vision-Language Tasks
论文作者
论文摘要
注意模型被广泛用于视觉(V-L)任务中以执行视觉文本相关性。人类与对视觉世界的强烈语言理解相关。但是,即使是V-L任务中表现最好的注意力模型也缺乏如此高级的语言理解,从而在模式之间造成了语义差距。在本文中,我们提出了一种注意机制 - 语言意识到的注意力(LAT),它利用从通用对象检测器获得的对象属性以及预训练的语言模型来减少这种语义差距。 LAT代表了通用语言富裕空间中的视觉和文本方式,因此为注意力过程提供了语言意识。我们在三个V-L任务中应用并演示LAT的有效性:Counting-VQA,VQA和图像字幕。在Counting-VQA中,我们提出了一个新颖的计数特定VQA模型,以预测直观计数并在五个数据集上实现最新结果。在VQA和字幕上,我们通过将其调整为各种基线并始终如一地提高其性能来展示LAT的通用性质和有效性。
Attention models are widely used in Vision-language (V-L) tasks to perform the visual-textual correlation. Humans perform such a correlation with a strong linguistic understanding of the visual world. However, even the best performing attention model in V-L tasks lacks such a high-level linguistic understanding, thus creating a semantic gap between the modalities. In this paper, we propose an attention mechanism - Linguistically-aware Attention (LAT) - that leverages object attributes obtained from generic object detectors along with pre-trained language models to reduce this semantic gap. LAT represents visual and textual modalities in a common linguistically-rich space, thus providing linguistic awareness to the attention process. We apply and demonstrate the effectiveness of LAT in three V-L tasks: Counting-VQA, VQA, and Image captioning. In Counting-VQA, we propose a novel counting-specific VQA model to predict an intuitive count and achieve state-of-the-art results on five datasets. In VQA and Captioning, we show the generic nature and effectiveness of LAT by adapting it into various baselines and consistently improving their performance.