视觉空间推理

论文标题

视觉空间推理

Visual Spatial Reasoning

论文作者

Liu, Fangyu, Emerson, Guy, Collier, Nigel

论文摘要

空间关系是人类认知的基本组成部分。但是，它们以多种方式以自然语言表示，以前的工作表明，当前的视觉和语言模型（VLM）难以捕获关系信息。在本文中，我们介绍了视觉空间推理（VSR），该数据集包含超过10k的自然文本图像对，具有66种英文中的空间关系（例如：下，在：下，面向和面对面）。在使用看似简单的注释格式的同时，我们展示了数据集如何包含具有挑战性的语言现象，例如不同的参考帧。我们证明了人类和模型性能之间的巨大差距：人类的天花板高于95％，而最先进的模型仅达到70％左右。我们观察到，VLMS的副用性能与培训示例的数量几乎没有相关性，并且经过测试的模型通常无法识别有关对象方向的关系。

Spatial relations are a basic part of human cognition. However, they are expressed in natural language in a variety of ways, and previous work has suggested that current vision-and-language models (VLMs) struggle to capture relational information. In this paper, we present Visual Spatial Reasoning (VSR), a dataset containing more than 10k natural text-image pairs with 66 types of spatial relations in English (such as: under, in front of, and facing). While using a seemingly simple annotation format, we show how the dataset includes challenging linguistic phenomena, such as varying reference frames. We demonstrate a large gap between human and model performance: the human ceiling is above 95%, while state-of-the-art models only achieve around 70%. We observe that VLMs' by-relation performances have little correlation with the number of training examples and the tested models are in general incapable of recognising relations concerning the orientations of objects.

下载PDF全文

下载文献需遵守相关版权规定

论文标题