KVL-bert：知识增强了视觉和语言的视觉常识性推理

论文标题

KVL-bert：知识增强了视觉和语言的视觉常识性推理

KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for Visual Commonsense Reasoning

论文作者

Song, Dandan, Ma, Siyi, Sun, Zhanchen, Yang, Sicheng, Liao, Lejian

论文摘要

推理是完全视觉理解的关键能力。为了开发具有认知水平的视觉理解和推理能力的机器，已经引入了视觉常识性推理（VCR）任务。在VCR中，给定一个关于图像的挑战性问题，机器必须正确回答，然后为答案提供理由提供理由。采用强大的BERT模型作为学习图像内容和自然语言联合表示的骨干的方法显示了有希望的改进VCR。但是，现有的方法都没有在视觉常识性推理中使用常识性知识，我们认为这将对这项任务有很大帮助。借助常识性知识，即使图像中未描述所需的信息也可以通过认知推理来回答复杂的问题。因此，我们将常识性知识纳入交叉模式的BERT中，并提出了一种新颖的知识增强了视觉和语言的BERT（简称KVL-BERT）模型。除了将视觉内容和语言内容作为输入之外，从ConceptNet提取的外平均知识还集成到多层变压器中。为了保留原始句子的结构信息和语义表示，我们建议使用相对位置嵌入和掩盖自我意见来削弱注入的常识知识与输入序列中其他无关组件之间的效果。与其他特定于任务的模型和一般任务不合时宜的预训练模型相比，我们的KVL-BERT优于它们的优于它们。

Reasoning is a critical ability towards complete visual understanding. To develop machine with cognition-level visual understanding and reasoning abilities, the visual commonsense reasoning (VCR) task has been introduced. In VCR, given a challenging question about an image, a machine must answer correctly and then provide a rationale justifying its answer. The methods adopting the powerful BERT model as the backbone for learning joint representation of image content and natural language have shown promising improvements on VCR. However, none of the existing methods have utilized commonsense knowledge in visual commonsense reasoning, which we believe will be greatly helpful in this task. With the support of commonsense knowledge, complex questions even if the required information is not depicted in the image can be answered with cognitive reasoning. Therefore, we incorporate commonsense knowledge into the cross-modal BERT, and propose a novel Knowledge Enhanced Visual-and-Linguistic BERT (KVL-BERT for short) model. Besides taking visual and linguistic contents as input, external commonsense knowledge extracted from ConceptNet is integrated into the multi-layer Transformer. In order to reserve the structural information and semantic representation of the original sentence, we propose using relative position embedding and mask-self-attention to weaken the effect between the injected commonsense knowledge and other unrelated components in the input sequence. Compared to other task-specific models and general task-agnostic pre-training models, our KVL-BERT outperforms them by a large margin.

下载PDF全文

下载文献需遵守相关版权规定

论文标题