次要预测可以告诉我们什么？探索小队V2.0的提问的探索

论文标题

次要预测可以告诉我们什么？探索小队V2.0的提问的探索

What Can Secondary Predictions Tell Us? An Exploration on Question-Answering with SQuAD-v2.0

论文作者

Kamfonas, Michael, Alon, Gabriel

论文摘要

自然语言处理中的性能，特别是针对问题解答任务的表现，通常是通过比较最自信（主要）预测与黄金答案（地面真相）的模型来衡量的。我们证明，即使对于失败的示例，量化模型的距离也有多大的预测答案也很有用。我们将一个例子的黄金等级（GR）定义为其最自信的预测的等级，该预测与地面真理完全匹配，并展示了为什么总是存在这样的比赛。对于我们分析的16个变压器模型，大多数与二级预测空间中完全匹配的金答案徘徊在最接近最高等级。我们将次要预测称为降低置信度概率顺序以上0的预测。我们演示了如何使用GR来对问题进行分类和可视化他们的困难范围，从持续的成功到持续的极端失败。我们在整个测试集中得出了一个新的汇总统计量，该统计量被称为黄金等级插值中位数（GRIM），该统计量量化了失败的预测与模型最佳选择的距离。为了开发一些直觉并探索这些指标的适用性，我们使用了斯坦福大学的问题回答数据集（Squad-2）和一些来自拥抱面枢纽的流行变压器模型。我们首先证明了严峻的与F1和精确匹配（EM）分数没有直接相关。然后，我们计算和可视化各种变压器架构的这些分数，通过群集的预测失败，探测其在错误分析中的适用性，并比较它们与其他训练诊断（例如EM和F1分数）的关系。我们最终提出了各种研究目标，例如扩大这些指标的数据收集及其在对抗培训中的可能使用。

Performance in natural language processing, and specifically for the question-answer task, is typically measured by comparing a modelś most confident (primary) prediction to golden answers (the ground truth). We are making the case that it is also useful to quantify how close a model came to predicting a correct answer even for examples that failed. We define the Golden Rank (GR) of an example as the rank of its most confident prediction that exactly matches a ground truth, and show why such a match always exists. For the 16 transformer models we analyzed, the majority of exactly matched golden answers in secondary prediction space hover very close to the top rank. We refer to secondary predictions as those ranking above 0 in descending confidence probability order. We demonstrate how the GR can be used to classify questions and visualize their spectrum of difficulty, from persistent near successes to persistent extreme failures. We derive a new aggregate statistic over entire test sets, named the Golden Rank Interpolated Median (GRIM) that quantifies the proximity of failed predictions to the top choice made by the model. To develop some intuition and explore the applicability of these metrics we use the Stanford Question Answering Dataset (SQuAD-2) and a few popular transformer models from the Hugging Face hub. We first demonstrate that the GRIM is not directly correlated with the F1 and exact match (EM) scores. We then calculate and visualize these scores for various transformer architectures, probe their applicability in error analysis by clustering failed predictions, and compare how they relate to other training diagnostics such as the EM and F1 scores. We finally suggest various research goals, such as broadening data collection for these metrics and their possible use in adversarial training.

下载PDF全文

下载文献需遵守相关版权规定

论文标题