模型从问题回答数据集中学到了什么？

论文标题

模型从问题回答数据集中学到了什么？

What do Models Learn from Question Answering Datasets?

论文作者

Sen, Priyanka, Saffari, Amir

论文摘要

尽管模型在流行的问题答案（QA）数据集（例如小队）上达到了超人的表现，但他们尚未在回答自己的问题的任务上超越人类。在本文中，我们通过评估五个数据集的基于BERT的模型来研究模型是否正在学习从质量检查数据集中阅读理解。我们评估了模型对域外示例的概括性，对丢失或错误数据的回答以及处理问题变化的能力。我们发现，没有一个数据集对我们的所有实验都有鲁棒性，并且可以确定数据集和评估方法中的缺点。经过分析，我们提出建议，以构建未来的质量检查数据集，以更好地评估通过阅读理解回答的问答任务。我们还发布代码将QA数据集转换为共享格式，以便在https://github.com/amazon-research/qa-dataset-converter上更轻松地实验。

While models have reached superhuman performance on popular question answering (QA) datasets such as SQuAD, they have yet to outperform humans on the task of question answering itself. In this paper, we investigate if models are learning reading comprehension from QA datasets by evaluating BERT-based models across five datasets. We evaluate models on their generalizability to out-of-domain examples, responses to missing or incorrect data, and ability to handle question variations. We find that no single dataset is robust to all of our experiments and identify shortcomings in both datasets and evaluation methods. Following our analysis, we make recommendations for building future QA datasets that better evaluate the task of question answering through reading comprehension. We also release code to convert QA datasets to a shared format for easier experimentation at https://github.com/amazon-research/qa-dataset-converter.

下载PDF全文

下载文献需遵守相关版权规定

论文标题