知识路口的视觉问题推理：深层表示的挑战

论文标题

知识路口的视觉问题推理：深层表示的挑战

Knowledge-Routed Visual Question Reasoning: Challenges for Deep Representation Embedding

论文作者

Cao, Qingxing, Li, Bailin, Liang, Xiaodan, Wang, Keze, Lin, Liang

论文摘要

尽管有益于鼓励视觉问题答案（VQA）模型通过利用超出图像和文本上下文的输入输入相关性来发现基本知识，但现有的知识VQA数据集主要以众包方式注释，例如，通过Internet通过Internet从不同的用户那里收集问题和外部原因。除了知识推理的挑战之外，如何处理注释偏见也未解决，这通常会导致问题和答案之间表面上的过度拟合的相关性。为了解决这个问题，我们提出了一个新颖的数据集，称为VQA模型评估，名为“知识路由的视觉问题推理”。考虑到理想的VQA模型应正确感知图像上下文，了解问题并结合其学习的知识，我们提出的数据集旨在截止当前深层嵌入模型所利用的快捷方式学习，并推动基于知识的视觉问题的研究边界。具体而言，我们基于视觉基因组场景图和具有受控程序的外部知识基础生成问题解答对，以将知识与其他偏见相关。这些程序可以从场景图或知识库中选择一个或两个三胞胎，以推动多步推理，避免回答歧义并平衡答案分布。与现有的VQA数据集相反，我们进一步暗示了以下对程序合并知识推理的两个主要限制：i）多个知识三胞胎可以与问题有关，但只有一个知识与图像对象有关。这可以强制执行VQA模型正确感知图像，而不是仅仅基于给定的问题猜测知识。 ii）所有问题均基于不同的知识，但是候选人的答案对于培训和测试集都是相同的。

Though beneficial for encouraging the Visual Question Answering (VQA) models to discover the underlying knowledge by exploiting the input-output correlation beyond image and text contexts, the existing knowledge VQA datasets are mostly annotated in a crowdsource way, e.g., collecting questions and external reasons from different users via the internet. In addition to the challenge of knowledge reasoning, how to deal with the annotator bias also remains unsolved, which often leads to superficial over-fitted correlations between questions and answers. To address this issue, we propose a novel dataset named Knowledge-Routed Visual Question Reasoning for VQA model evaluation. Considering that a desirable VQA model should correctly perceive the image context, understand the question, and incorporate its learned knowledge, our proposed dataset aims to cutoff the shortcut learning exploited by the current deep embedding models and push the research boundary of the knowledge-based visual question reasoning. Specifically, we generate the question-answer pair based on both the Visual Genome scene graph and an external knowledge base with controlled programs to disentangle the knowledge from other biases. The programs can select one or two triplets from the scene graph or knowledge base to push multi-step reasoning, avoid answer ambiguity, and balanced the answer distribution. In contrast to the existing VQA datasets, we further imply the following two major constraints on the programs to incorporate knowledge reasoning: i) multiple knowledge triplets can be related to the question, but only one knowledge relates to the image object. This can enforce the VQA model to correctly perceive the image instead of guessing the knowledge based on the given question solely; ii) all questions are based on different knowledge, but the candidate answers are the same for both the training and test sets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题