通过多参考对抗数据集和大规模预审计，改进对话框评估

论文标题

通过多参考对抗数据集和大规模预审计，改进对话框评估

Improving Dialog Evaluation with a Multi-reference Adversarial Dataset and Large Scale Pretraining

论文作者

Sai, Ananya B., Mohankumar, Akash Kumar, Arora, Siddhartha, Khapra, Mitesh M.

论文摘要

越来越重视基于模型的对话评估指标，例如ADEM，RUBER和最新的基于BERT的指标。这些模型旨在为所有相关响应分配高分，并且对所有无关的响应的分数较低。理想情况下，应在任何给定情况下使用多个相关和无关的响应对此类模型进行培训。但是，没有这样的数据可公开可用，因此通常使用单个相关响应和来自其他上下文（随机负面负面）的多个随机选择的响应对现有模型进行培训。为了更好地培训基于模型的指标，我们介绍了DailyDialog ++数据集，该数据集由（i）每个上下文中的五个相关响应组成，以及（ii）五个针对每个上下文的对抗性无关的响应。使用此数据集，我们首先表明，即使存在多个正确的参考，基于N-Gram的指标和基于嵌入的指标也不能很好地在将相关响应与甚至随机负面的响应中分离出来。虽然基于模型的指标的性能要比N-graM的表现更好，并且基于嵌入的指标对随机负面因素的表现却大幅下降，但在对抗示例中评估时的性能会大大下降。为了检查大规模预处理是否可以有所帮助，我们提出了一个新的基于BERT的评估指标，称为DEB，该度量是在727m的Reddit对话中审议的，然后在我们的数据集中进行了填充。 DEB极大地胜过现有模型，显示出与人类判断的更好相关性，并且在随机负面方面的表现更好（精度为88.27％）。但是，在对对抗反应进行评估时，其性能再次大大下降，从而强调，即使是大规模的评估模型也对我们数据集中的对抗示例也不强大。数据集和代码公开可用。

There is an increasing focus on model-based dialog evaluation metrics such as ADEM, RUBER, and the more recent BERT-based metrics. These models aim to assign a high score to all relevant responses and a low score to all irrelevant responses. Ideally, such models should be trained using multiple relevant and irrelevant responses for any given context. However, no such data is publicly available, and hence existing models are usually trained using a single relevant response and multiple randomly selected responses from other contexts (random negatives). To allow for better training and robust evaluation of model-based metrics, we introduce the DailyDialog++ dataset, consisting of (i) five relevant responses for each context and (ii) five adversarially crafted irrelevant responses for each context. Using this dataset, we first show that even in the presence of multiple correct references, n-gram based metrics and embedding based metrics do not perform well at separating relevant responses from even random negatives. While model-based metrics perform better than n-gram and embedding based metrics on random negatives, their performance drops substantially when evaluated on adversarial examples. To check if large scale pretraining could help, we propose a new BERT-based evaluation metric called DEB, which is pretrained on 727M Reddit conversations and then finetuned on our dataset. DEB significantly outperforms existing models, showing better correlation with human judgements and better performance on random negatives (88.27% accuracy). However, its performance again drops substantially, when evaluated on adversarial responses, thereby highlighting that even large-scale pretrained evaluation models are not robust to the adversarial examples in our dataset. The dataset and code are publicly available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题