论文标题
TuringAdvice:语言使用的生成且动态的评估
TuringAdvice: A Generative and Dynamic Evaluation of Language Use
论文作者
论文摘要
我们提出了TuringAdvice,这是一项新的挑战任务,以及用于语言理解模型的数据集。鉴于真实人当前面临的书面情况,模型必须在自然语言中产生有用的建议。我们的评估框架测试了人类语言理解的一个基本方面:我们使用语言通过相互交流来解决开放式情况的能力。 经验结果表明,当今的模型在TuringAdvice方面很难,甚至在600K内域培训示例上进行了数十亿个参数模型。最佳模型是一个固定的T5,它写的建议至少在仅14%的情况下与人写的建议一样有用;一个更大的非可靠GPT3模型的情况甚至更糟,为4%。这种低性能揭示了语言理解错误,这些错误很难在生成环境之外发现,这显示了很多进步的空间。
We propose TuringAdvice, a new challenge task and dataset for language understanding models. Given a written situation that a real person is currently facing, a model must generate helpful advice in natural language. Our evaluation framework tests a fundamental aspect of human language understanding: our ability to use language to resolve open-ended situations by communicating with each other. Empirical results show that today's models struggle at TuringAdvice, even multibillion parameter models finetuned on 600k in-domain training examples. The best model, a finetuned T5, writes advice that is at least as helpful as human-written advice in only 14% of cases; a much larger non-finetunable GPT3 model does even worse at 4%. This low performance reveals language understanding errors that are hard to spot outside of a generative setting, showing much room for progress.