在视觉对话框中使用多结构常识知识推理

论文标题

在视觉对话框中使用多结构常识知识推理

Reasoning with Multi-Structure Commonsense Knowledge in Visual Dialog

论文作者

Zhang, Shunyu, Jiang, Xiaoze, Yang, Zequn, Wan, Tao, Qin, Zengchang

论文摘要

视觉对话框要求代理商与以图像为基础的人进行对话。许多有关视觉对话的研究集中在对对话记录或图像内容的理解上，而大量常识要求的问题被忽略。处理这些情况取决于需要常识性先验的逻辑推理。如何捕获相关的常识性知识与历史相辅相成，图像仍然是一个关键挑战。在本文中，我们通过具有多结构常识知识（RMK）的推理提出了一种新型模型。在我们的模型中，外部知识以句子级别的事实和图形级别的事实表示，以适当适合对话记录和图像的综合场景。除了这些多结构表示外，我们的模型可以通过基于图的交互和基于变压器的融合来捕获相关知识，并将其纳入视觉和语义特征。 Visdial V1.0和VisdialCK数据集的实验结果和分析表明，我们提出的模型有效地表现了比较方法。

Visual Dialog requires an agent to engage in a conversation with humans grounded in an image. Many studies on Visual Dialog focus on the understanding of the dialog history or the content of an image, while a considerable amount of commonsense-required questions are ignored. Handling these scenarios depends on logical reasoning that requires commonsense priors. How to capture relevant commonsense knowledge complementary to the history and the image remains a key challenge. In this paper, we propose a novel model by Reasoning with Multi-structure Commonsense Knowledge (RMK). In our model, the external knowledge is represented with sentence-level facts and graph-level facts, to properly suit the scenario of the composite of dialog history and image. On top of these multi-structure representations, our model can capture relevant knowledge and incorporate them into the vision and semantic features, via graph-based interaction and transformer-based fusion. Experimental results and analysis on VisDial v1.0 and VisDialCK datasets show that our proposed model effectively outperforms comparative methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题