论文标题
问答测试训练在开放域中重叠答案数据集
Question and Answer Test-Train Overlap in Open-Domain Question Answering Datasets
论文作者
论文摘要
理想的开放域问答模型应表现出许多能力,从简单地记住训练时间的问题到回答新的问题,并在训练过程中看到的答案,再到带有新颖的答案的完全新颖的问题。但是,单个汇总测试集分数并未显示出真正具有哪些功能模型的完整图。在这项工作中,我们对这些能力的三个流行开放域基准数据集的测试集进行了详细研究。我们发现培训集中的某个地方也存在60-70%的测试时间答案。我们还发现,在相应的培训集中,有30%的测试集问题在其相应的培训集中具有近乎删除的措施。使用这些发现,我们评估了各种流行的开放域模型,以获得更深入的见解,以了解它们实际上可以概括的程度以及驱动其整体性能的原因。我们发现,所有模型在无法记住的训练集中的问题上表现出色,而重复和未重复的数据之间的平均绝对性能差异为63%。最后,我们表明,简单的最近邻居模型超过了BART封闭式质量检查模型,进一步强调了训练设置在这些基准测试中扮演的角色
Ideally Open-Domain Question Answering models should exhibit a number of competencies, ranging from simply memorizing questions seen at training time, to answering novel question formulations with answers seen during training, to generalizing to completely novel questions with novel answers. However, single aggregated test set scores do not show the full picture of what capabilities models truly have. In this work, we perform a detailed study of the test sets of three popular open-domain benchmark datasets with respect to these competencies. We find that 60-70% of test-time answers are also present somewhere in the training sets. We also find that 30% of test-set questions have a near-duplicate paraphrase in their corresponding training sets. Using these findings, we evaluate a variety of popular open-domain models to obtain greater insight into what extent they can actually generalize, and what drives their overall performance. We find that all models perform dramatically worse on questions that cannot be memorized from training sets, with a mean absolute performance difference of 63% between repeated and non-repeated data. Finally we show that simple nearest-neighbor models out-perform a BART closed-book QA model, further highlighting the role that training set memorization plays in these benchmarks