原QA：一个针对原型常识推理的问题回答数据集

论文标题

原QA：一个针对原型常识推理的问题回答数据集

ProtoQA: A Question Answering Dataset for Prototypical Common-Sense Reasoning

论文作者

Boratko, Michael, Li, Xiang Lorraine, Das, Rajarshi, O'Gorman, Tim, Le, Dan, McCallum, Andrew

论文摘要

有关于某些原型情况的问题，例如人们在离开房子上班之前通常会做的事情？人可以通过获得的经验轻松回答它们。对于此类问题，可以有多个正确的答案，其中某些情况比其他问题更常见。本文介绍了一个新的问题回答数据集，用于培训和评估这种原型情况下人工智能系统的常识推理能力。训练集是从长期运行的国际游戏节目家庭中播出的现有问题中收集的。隐藏的评估集是通过收集100名群众工作者的每个问题的答案来创建的。我们还提出了一项生成评估任务，其中模型必须输出排名的答案列表，理想情况下涵盖了一个问题的所有原型答案。在提出了多个竞争基线模型之后，我们发现人类的绩效仍然超过所有评估指标的模型得分，并具有有意义的差距，支持任务的挑战性质。

Given questions regarding some prototypical situation such as Name something that people usually do before they leave the house for work? a human can easily answer them via acquired experiences. There can be multiple right answers for such questions, with some more common for a situation than others. This paper introduces a new question answering dataset for training and evaluating common sense reasoning capabilities of artificial intelligence systems in such prototypical situations. The training set is gathered from an existing set of questions played in a long-running international game show FAMILY- FEUD. The hidden evaluation set is created by gathering answers for each question from 100 crowd-workers. We also propose a generative evaluation task where a model has to output a ranked list of answers, ideally covering all prototypical answers for a question. After presenting multiple competitive baseline models, we find that human performance still exceeds model scores on all evaluation metrics with a meaningful gap, supporting the challenging nature of the task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题