A-OKVQA：使用世界知识回答视觉问题的基准

论文标题

A-OKVQA：使用世界知识回答视觉问题的基准

A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge

论文作者

Schwenk, Dustin, Khandelwal, Apoorv, Clark, Christopher, Marino, Kenneth, Mottaghi, Roozbeh

论文摘要

视觉问题回答（VQA）任务愿意为开发AI模型提供有意义的测试床，以共同推荐视觉和自然语言输入。尽管VQA数据集扩散，但该目标受到一组共同限制的阻碍。这些包括依靠相对简单的问题，这些问题在概念和语言结构中都是重复性的，配对图像之外所需的世界知识很少，以及得出正确答案所需的有限推理。我们介绍了A-OKVQA，这是一个众包数据集，该数据集由大约25K的各种问题组成，需要广泛的常识性和世界知识来回答。与现有的基于知识的VQA数据集相反，这些问题通常无法通过简单地查询知识库来回答，而是需要对图像中描述的场景进行某种形式的常识性推理。我们通过详细分析其内容和基线性能测量值对各种最先进的视觉语言模型，证明了该新数据集的潜力。项目页面：http：//a-okvqa.allenai.org/

The Visual Question Answering (VQA) task aspires to provide a meaningful testbed for the development of AI models that can jointly reason over visual and natural language inputs. Despite a proliferation of VQA datasets, this goal is hindered by a set of common limitations. These include a reliance on relatively simplistic questions that are repetitive in both concepts and linguistic structure, little world knowledge needed outside of the paired image, and limited reasoning required to arrive at the correct answer. We introduce A-OKVQA, a crowdsourced dataset composed of a diverse set of about 25K questions requiring a broad base of commonsense and world knowledge to answer. In contrast to the existing knowledge-based VQA datasets, the questions generally cannot be answered by simply querying a knowledge base, and instead require some form of commonsense reasoning about the scene depicted in the image. We demonstrate the potential of this new dataset through a detailed analysis of its contents and baseline performance measurements over a variety of state-of-the-art vision-language models. Project page: http://a-okvqa.allenai.org/

下载PDF全文

下载文献需遵守相关版权规定

论文标题