给我吃点东西：用常识知识参考表达理解

论文标题

给我吃点东西：用常识知识参考表达理解

Give Me Something to Eat: Referring Expression Comprehension with Commonsense Knowledge

论文作者

Wang, Peng, Liu, Dongyang, Li, Hui, Wu, Qi

论文摘要

传统的推荐表达理解（REF）假设人们通过描述其视觉外观和空间位置来从图像中查询某些内容，但是实际上，我们经常通过描述其负担能力或其他非视觉属性来要求一个物体，尤其是当我们没有精确的目标时。例如，有时我们会说“给我吃点东西”。在这种情况下，我们需要使用常识知识来识别图像中的对象。不幸的是，这不是反映这一要求的现有引用的表达数据集，更不用说一个模型来应对这一挑战。在本文中，我们收集了一个名为KB-REF的新引用表达数据集，其中包含16K图像上的43K表达式。在KB-REF中，要回答每个表达式（检测到表达式所述的目标对象），必须至少需要一段常识知识。然后，我们在KB-REF上测试了最新的REF模型（SOTA）REF模型，发现与一般REF数据集的出色表现相比，它们都显示出大量下降。我们还提出了一个表达条件的图像和事实注意（ECIFA）网络，该网络从相关的图像区域和常识性知识事实中提取信息。我们的方法导致对SOTA REF模型的显着改善，尽管这种强大的基线和人类绩效之间仍然存在差距。数据集和基线模型将发布。

Conventional referring expression comprehension (REF) assumes people to query something from an image by describing its visual appearance and spatial location, but in practice, we often ask for an object by describing its affordance or other non-visual attributes, especially when we do not have a precise target. For example, sometimes we say 'Give me something to eat'. In this case, we need to use commonsense knowledge to identify the objects in the image. Unfortunately, these is no existing referring expression dataset reflecting this requirement, not to mention a model to tackle this challenge. In this paper, we collect a new referring expression dataset, called KB-Ref, containing 43k expressions on 16k images. In KB-Ref, to answer each expression (detect the target object referred by the expression), at least one piece of commonsense knowledge must be required. We then test state-of-the-art (SoTA) REF models on KB-Ref, finding that all of them present a large drop compared to their outstanding performance on general REF datasets. We also present an expression conditioned image and fact attention (ECIFA) network that extract information from correlated image regions and commonsense knowledge facts. Our method leads to a significant improvement over SoTA REF models, although there is still a gap between this strong baseline and human performance. The dataset and baseline models will be released.

下载PDF全文

下载文献需遵守相关版权规定

论文标题