生成对话系统的评估协议

论文标题

生成对话系统的评估协议

An Evaluation Protocol for Generative Conversational Systems

论文作者

Lee, Seolhwa, Lim, Heuiseok, Sedoc, João

论文摘要

有许多用于开放域对话系统的新型生成模型。但是，没有对不同系统的系统评估。系统的比较需要实验设计，评估集，对话系统及其输出以及统计分析的一致性。我们制定了一个协议，用于评估对话模型，并使用头对头成对比较。我们分析了十个最近的模型，该模型使用配对的头对头性能（Win-loss-tie）在五个评估数据集上声称最先进的性能。我们的发现表明，对话和搅拌机是使用Bradley-Terry模型和Trueskill排名方法的优越系统。这些发现证明了我们协议评估对话剂和评估集的可行性。最后，我们对研究人员进行所有代码和评估，以将其模型与其他最先进的对话框模型进行比较。

There is a multitude of novel generative models for open-domain conversational systems; however, there is no systematic evaluation of different systems. Systematic comparisons require consistency in experimental design, evaluation sets, conversational systems and their outputs, and statistical analysis. We lay out a protocol for the evaluation of conversational models using head-to-head pairwise comparison. We analyze ten recent models that claim state-of-the-art performance using a paired head-to-head performance (win-loss-tie) on five evaluation datasets. Our findings show that DialoGPT and Blender are superior systems using Bradley-Terry model and TrueSkill ranking methods. These findings demonstrate the feasibility of our protocol to evaluate conversational agents and evaluation sets. Finally, we make all code and evaluations publicly available for researchers to compare their model to other state-of-the-art dialog models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题