除了用户自我报告的李克特量表评级：自动对话评估的比较模型

论文标题

除了用户自我报告的李克特量表评级：自动对话评估的比较模型

Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for Automatic Dialog Evaluation

论文作者

Liang, Weixin, Zou, James, Yu, Zhou

论文摘要

开放域对话系统评估是对话研究中最重要的挑战之一。现有的自动评估指标（例如BLEU）主要基于参考。他们计算生成的响应与有限数量的可用参考之间的差异。基于Likert评分的自我报告的用户评级被社交对话系统（例如Amazon Alexa奖聊天机器人）广泛采用。但是，自我报告的用户评级遭受了不同用户之间的偏见和差异。为了减轻此问题，我们将对话评估作为比较任务。我们还提出了一个自动评估模型CMADE（自动对话评估的比较模型），该模型会自动清洁自我报告的用户评分。具体而言，我们首先使用一种自我监督的方法来学习更好的对话框特征表示，然后使用KNN和Shapley去除令人困惑的样本。我们的实验表明，CMADE在对话比较任务中达到了89.2％的精度。

Open Domain dialog system evaluation is one of the most important challenges in dialog research. Existing automatic evaluation metrics, such as BLEU are mostly reference-based. They calculate the difference between the generated response and a limited number of available references. Likert-score based self-reported user rating is widely adopted by social conversational systems, such as Amazon Alexa Prize chatbots. However, self-reported user rating suffers from bias and variance among different users. To alleviate this problem, we formulate dialog evaluation as a comparison task. We also propose an automatic evaluation model CMADE (Comparison Model for Automatic Dialog Evaluation) that automatically cleans self-reported user ratings as it trains on them. Specifically, we first use a self-supervised method to learn better dialog feature representation, and then use KNN and Shapley to remove confusing samples. Our experiments show that CMADE achieves 89.2% accuracy in the dialog comparison task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题