论文标题
对话的人类评估是一个开放的问题:比较评估对话代理的各种方法的敏感性
Human Evaluation of Conversations is an Open Problem: comparing the sensitivity of various methods for evaluating dialogue agents
论文作者
论文摘要
改善对话型AI的核心是如何评估对话的开放问题。自动指标的问题是众所周知的(Liu等,2016,Arxiv:1603.08023),人类评估仍然被视为黄金标准。不幸的是,如何进行人类评估也是一个开放的问题:不同的数据收集方法具有不同水平的人类一致性和统计敏感性,从而导致人类注释时间和人工成本的不同数量。在这项工作中,我们比较了五种不同的基于人群的人类评估方法,并发现最好的方法是根据比较的模型类型的类型,而没有明显的赢家。尽管这突出了该地区的开放问题,但我们的分析导致建议何时使用哪个以及可能的未来方向。
At the heart of improving conversational AI is the open problem of how to evaluate conversations. Issues with automatic metrics are well known (Liu et al., 2016, arXiv:1603.08023), with human evaluations still considered the gold standard. Unfortunately, how to perform human evaluations is also an open problem: differing data collection methods have varying levels of human agreement and statistical sensitivity, resulting in differing amounts of human annotation hours and labor costs. In this work we compare five different crowdworker-based human evaluation methods and find that different methods are best depending on the types of models compared, with no clear winner across the board. While this highlights the open problems in the area, our analysis leads to advice of when to use which one, and possible future directions.