论文标题
视觉空间推理
Visual Spatial Reasoning
论文作者
论文摘要
空间关系是人类认知的基本组成部分。但是,它们以多种方式以自然语言表示,以前的工作表明,当前的视觉和语言模型(VLM)难以捕获关系信息。在本文中,我们介绍了视觉空间推理(VSR),该数据集包含超过10k的自然文本图像对,具有66种英文中的空间关系(例如:下,在:下,面向和面对面)。在使用看似简单的注释格式的同时,我们展示了数据集如何包含具有挑战性的语言现象,例如不同的参考帧。我们证明了人类和模型性能之间的巨大差距:人类的天花板高于95%,而最先进的模型仅达到70%左右。我们观察到,VLMS的副用性能与培训示例的数量几乎没有相关性,并且经过测试的模型通常无法识别有关对象方向的关系。
Spatial relations are a basic part of human cognition. However, they are expressed in natural language in a variety of ways, and previous work has suggested that current vision-and-language models (VLMs) struggle to capture relational information. In this paper, we present Visual Spatial Reasoning (VSR), a dataset containing more than 10k natural text-image pairs with 66 types of spatial relations in English (such as: under, in front of, and facing). While using a seemingly simple annotation format, we show how the dataset includes challenging linguistic phenomena, such as varying reference frames. We demonstrate a large gap between human and model performance: the human ceiling is above 95%, while state-of-the-art models only achieve around 70%. We observe that VLMs' by-relation performances have little correlation with the number of training examples and the tested models are in general incapable of recognising relations concerning the orientations of objects.