真正的侦探：GPT-3的深层绑架推理基准Underable，对GPT-4的挑战

论文标题

真正的侦探：GPT-3的深层绑架推理基准Underable，对GPT-4的挑战

True Detective: A Deep Abductive Reasoning Benchmark Undoable for GPT-3 and Challenging for GPT-4

论文作者

Del, Maksym, Fishel, Mark

论文摘要

大型语言模型（LLMS）已证明了稳定的零摄影推理能力，这反映在其当前测试任务上的性能中。这需要一个更具挑战性的基准，需要解决高级推理能力。在本文中，我们介绍了这样的基准，包括191个长格式（平均1200个单词）的神秘叙事，这些叙述构成了侦探难题。难题来自“ 5分钟神秘”平台，并包括一个多项选择的问题以进行评估。只有47％的人平均成功解决了难题，而最佳人类求解器的成功率超过80％。我们表明，GPT-3模型在此基准测试上几乎不超过随机的（具有28％的精度），而最先进的GPT-4仅求解了38％的难题。这表明LLM和人类的深层推理能力仍然存在很大的差距，并强调了该领域进一步研究的必要性。我们的工作为未来的语言模型推理研究介绍了一个具有挑战性的基准，并有助于更好地理解LLMS能力的极限。

Large language models (LLMs) have demonstrated solid zero-shot reasoning capabilities, which is reflected in their performance on the current test tasks. This calls for a more challenging benchmark requiring highly advanced reasoning ability to be solved. In this paper, we introduce such a benchmark, consisting of 191 long-form (1200 words on average) mystery narratives constructed as detective puzzles. Puzzles are sourced from the "5 Minute Mystery" platform and include a multiple-choice question for evaluation. Only 47% of humans solve a puzzle successfully on average, while the best human solvers achieve over 80% success rate. We show that GPT-3 models barely outperform random on this benchmark (with 28% accuracy) while state-of-the-art GPT-4 solves only 38% of puzzles. This indicates that there is still a significant gap in the deep reasoning abilities of LLMs and humans and highlights the need for further research in this area. Our work introduces a challenging benchmark for future studies on reasoning in language models and contributes to a better understanding of the limits of LLMs' abilities.

下载PDF全文

下载文献需遵守相关版权规定

论文标题