论文标题

TuringAdvice:语言使用的生成且动态的评估

TuringAdvice: A Generative and Dynamic Evaluation of Language Use

论文作者

Zellers, Rowan, Holtzman, Ari, Clark, Elizabeth, Qin, Lianhui, Farhadi, Ali, Choi, Yejin

论文摘要

我们提出了TuringAdvice,这是一项新的挑战任务,以及用于语言理解模型的数据集。鉴于真实人当前面临的书面情况,模型必须在自然语言中产生有用的建议。我们的评估框架测试了人类语言理解的一个基本方面:我们使用语言通过相互交流来解决开放式情况的能力。 经验结果表明,当今的模型在TuringAdvice方面很难,甚至在600K内域培训示例上进行了数十亿个参数模型。最佳模型是一个固定的T5,它写的建议至少在仅14%的情况下与人写的建议一样有用;一个更大的非可靠GPT3模型的情况甚至更糟,为4%。这种低性能揭示了语言理解错误,这些错误很难在生成环境之外发现,这显示了很多进步的空间。

We propose TuringAdvice, a new challenge task and dataset for language understanding models. Given a written situation that a real person is currently facing, a model must generate helpful advice in natural language. Our evaluation framework tests a fundamental aspect of human language understanding: our ability to use language to resolve open-ended situations by communicating with each other. Empirical results show that today's models struggle at TuringAdvice, even multibillion parameter models finetuned on 600k in-domain training examples. The best model, a finetuned T5, writes advice that is at least as helpful as human-written advice in only 14% of cases; a much larger non-finetunable GPT3 model does even worse at 4%. This low performance reveals language understanding errors that are hard to spot outside of a generative setting, showing much room for progress.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源