点和问：将指向指向视觉问题回答

论文标题

点和问：将指向指向视觉问题回答

Point and Ask: Incorporating Pointing into Visual Question Answering

论文作者

Mani, Arjun, Yoo, Nobline, Hinthorn, Will, Russakovsky, Olga

论文摘要

视觉问题回答（VQA）已成为视觉识别进度的关键基准之一。已经探索了多个VQA扩展，以更好地模拟现实世界的设置：不同的问题表述，改变培训和测试分布，对话中的对话一致性以及基于解释的答录。在这项工作中，我们通过考虑包括空间参考点的视觉问题进一步扩展了这一空间。指向是人类中几乎普遍的手势，现实世界中的VQA可能涉及对目标区域的姿态。具体而言，我们（1）介绍和激励点输入问题作为VQA的扩展，（2）在此空间内定义三个新的问题，（3）对于每个类别，介绍了基准数据集和一系列基线模型来应对其独特的挑战。与先前的工作有两个关键区别。首先，我们明确设计基准测试需要点输入，即，我们确保没有空间参考就无法准确回答视觉问题。其次，我们明确探索了更现实的点空间输入，而不是标准但不自然的边界框输入。通过我们的探索，我们发现并解决了几种视觉识别挑战，包括推断人类意图的能力，在本地和全球范围内的理由，并有效地结合了视觉，语言和空间输入。代码可在以下网址获得：https：//github.com/princetonvisualai/pointingqa。

Visual Question Answering (VQA) has become one of the key benchmarks of visual recognition progress. Multiple VQA extensions have been explored to better simulate real-world settings: different question formulations, changing training and test distributions, conversational consistency in dialogues, and explanation-based answering. In this work, we further expand this space by considering visual questions that include a spatial point of reference. Pointing is a nearly universal gesture among humans, and real-world VQA is likely to involve a gesture towards the target region. Concretely, we (1) introduce and motivate point-input questions as an extension of VQA, (2) define three novel classes of questions within this space, and (3) for each class, introduce both a benchmark dataset and a series of baseline models to handle its unique challenges. There are two key distinctions from prior work. First, we explicitly design the benchmarks to require the point input, i.e., we ensure that the visual question cannot be answered accurately without the spatial reference. Second, we explicitly explore the more realistic point spatial input rather than the standard but unnatural bounding box input. Through our exploration we uncover and address several visual recognition challenges, including the ability to infer human intent, reason both locally and globally about the image, and effectively combine visual, language and spatial inputs. Code is available at: https://github.com/princetonvisualai/pointingqa .

下载PDF全文

下载文献需遵守相关版权规定

论文标题