几次解释的不可靠性提示了文本推理

论文标题

几次解释的不可靠性提示了文本推理

The Unreliability of Explanations in Few-shot Prompting for Textual Reasoning

论文作者

Ye, Xi, Durrett, Greg

论文摘要

像GPT-3这样的大型语言模型（LLM）是否有解释改善了文化学习？我们研究了两个NLP任务，涉及文本推理，即问答和自然语言推断。我们使用包括多种不同样式的说明的提示来测试三个文本推理数据集上四个LLM的性能。对于这些任务，我们发现，包括在OPT，GPT-3（DAVINCI）和Dendertgpt（Text-Davinci-001）提示中的解释仅能在标准的几个节目学习中得出较小至中等准确性的改进。但是，Text-Davinci-002能够获得更大的收益。我们进一步表明，LLMS产生的解释可能不需要模型的预测，也不是在输入中基于实际上的基础，即使是在具有提取性解释的简单任务上。但是，这些缺陷的解释仍然可以作为验证LLMS事后预测的一种方式。通过在我们的三个环境中的分析，我们表明，由人类判断为与输入和预测的解释是有效的 - 可能是具有准确预测的循环。按照这些观察结果，我们使用自动提取的分数来训练校准器，以评估解释的可靠性，从而使我们能够在所有数据集中提高事后的绩效。

Does prompting a large language model (LLM) like GPT-3 with explanations improve in-context learning? We study this question on two NLP tasks that involve reasoning over text, namely question answering and natural language inference. We test the performance of four LLMs on three textual reasoning datasets using prompts that include explanations in multiple different styles. For these tasks, we find that including explanations in the prompts for OPT, GPT-3 (davinci), and InstructGPT (text-davinci-001) only yields small to moderate accuracy improvements over standard few-show learning. However, text-davinci-002 is able to benefit more substantially. We further show that explanations generated by the LLMs may not entail the models' predictions nor be factually grounded in the input, even on simple tasks with extractive explanations. However, these flawed explanations can still be useful as a way to verify LLMs' predictions post-hoc. Through analysis in our three settings, we show that explanations judged by humans to be good--logically consistent with the input and the prediction--more likely cooccur with accurate predictions. Following these observations, we train calibrators using automatically extracted scores that assess the reliability of explanations, allowing us to improve performance post-hoc across all of our datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题