论文标题
除了用户自我报告的李克特量表评级:自动对话评估的比较模型
Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for Automatic Dialog Evaluation
论文作者
论文摘要
开放域对话系统评估是对话研究中最重要的挑战之一。现有的自动评估指标(例如BLEU)主要基于参考。他们计算生成的响应与有限数量的可用参考之间的差异。基于Likert评分的自我报告的用户评级被社交对话系统(例如Amazon Alexa奖聊天机器人)广泛采用。但是,自我报告的用户评级遭受了不同用户之间的偏见和差异。为了减轻此问题,我们将对话评估作为比较任务。我们还提出了一个自动评估模型CMADE(自动对话评估的比较模型),该模型会自动清洁自我报告的用户评分。具体而言,我们首先使用一种自我监督的方法来学习更好的对话框特征表示,然后使用KNN和Shapley去除令人困惑的样本。我们的实验表明,CMADE在对话比较任务中达到了89.2%的精度。
Open Domain dialog system evaluation is one of the most important challenges in dialog research. Existing automatic evaluation metrics, such as BLEU are mostly reference-based. They calculate the difference between the generated response and a limited number of available references. Likert-score based self-reported user rating is widely adopted by social conversational systems, such as Amazon Alexa Prize chatbots. However, self-reported user rating suffers from bias and variance among different users. To alleviate this problem, we formulate dialog evaluation as a comparison task. We also propose an automatic evaluation model CMADE (Comparison Model for Automatic Dialog Evaluation) that automatically cleans self-reported user ratings as it trains on them. Specifically, we first use a self-supervised method to learn better dialog feature representation, and then use KNN and Shapley to remove confusing samples. Our experiments show that CMADE achieves 89.2% accuracy in the dialog comparison task.