论文标题

Roscoe:一套逐步推理的指标

ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning

论文作者

Golovneva, Olga, Chen, Moya, Poff, Spencer, Corredor, Martin, Zettlemoyer, Luke, Fazel-Zarandi, Maryam, Celikyilmaz, Asli

论文摘要

当提示逐步推理以证明其最终答案合理时,大型语言模型显示出改进的下游任务性能。这些推理步骤极大地改善了模型的可解释性和验证,但是如果没有可靠的方法进行自动评估,就很难客观地研究其正确性(独立于最终答案)。我们根本不知道既定的推理步骤实际上支持最终任务预测的频率。在这项工作中,我们提出了Roscoe,这是一套可解释的,无监督的自动分数,这些分数改善和扩展了以前的文本生成评估指标。为了评估Roscoe针对基线指标,我们设计了推理错误的类型学,并在常用的推理数据集上收集合成和人类评估得分。与现有指标相反,罗斯科可以通过利用逐步理解的属性来衡量语义一致性,逻辑性,信息性,流利性和事实性。我们凭经验验证了指标在五个人体注释和六个编程性诊断数据集上的强度 - 涵盖了需要推理技能的各种任务,并表明罗斯科可以始终如一地超越基线指标。

Large language models show improved downstream task performance when prompted to generate step-by-step reasoning to justify their final answers. These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness (independent of the final answer) is difficult without reliable methods for automatic evaluation. We simply do not know how often the stated reasoning steps actually support the final end task predictions. In this work, we present ROSCOE, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics. To evaluate ROSCOE against baseline metrics, we design a typology of reasoning errors and collect synthetic and human evaluation scores on commonly used reasoning datasets. In contrast with existing metrics, ROSCOE can measure semantic consistency, logicality, informativeness, fluency, and factuality - among other traits - by leveraging properties of step-by-step rationales. We empirically verify the strength of our metrics on five human annotated and six programmatically perturbed diagnostics datasets - covering a diverse set of tasks that require reasoning skills and show that ROSCOE can consistently outperform baseline metrics.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源