TuringAdvice：语言使用的生成且动态的评估

论文标题

TuringAdvice：语言使用的生成且动态的评估

TuringAdvice: A Generative and Dynamic Evaluation of Language Use

论文作者

Zellers, Rowan, Holtzman, Ari, Clark, Elizabeth, Qin, Lianhui, Farhadi, Ali, Choi, Yejin

论文摘要

我们提出了TuringAdvice，这是一项新的挑战任务，以及用于语言理解模型的数据集。鉴于真实人当前面临的书面情况，模型必须在自然语言中产生有用的建议。我们的评估框架测试了人类语言理解的一个基本方面：我们使用语言通过相互交流来解决开放式情况的能力。经验结果表明，当今的模型在TuringAdvice方面很难，甚至在600K内域培训示例上进行了数十亿个参数模型。最佳模型是一个固定的T5，它写的建议至少在仅14％的情况下与人写的建议一样有用；一个更大的非可靠GPT3模型的情况甚至更糟，为4％。这种低性能揭示了语言理解错误，这些错误很难在生成环境之外发现，这显示了很多进步的空间。

We propose TuringAdvice, a new challenge task and dataset for language understanding models. Given a written situation that a real person is currently facing, a model must generate helpful advice in natural language. Our evaluation framework tests a fundamental aspect of human language understanding: our ability to use language to resolve open-ended situations by communicating with each other. Empirical results show that today's models struggle at TuringAdvice, even multibillion parameter models finetuned on 600k in-domain training examples. The best model, a finetuned T5, writes advice that is at least as helpful as human-written advice in only 14% of cases; a much larger non-finetunable GPT3 model does even worse at 4%. This low performance reveals language understanding errors that are hard to spot outside of a generative setting, showing much room for progress.

下载PDF全文

下载文献需遵守相关版权规定

论文标题