语言模型（主要）知道他们知道的

论文标题

语言模型（主要）知道他们知道的

Language Models (Mostly) Know What They Know

论文作者

Kadavath, Saurav, Conerly, Tom, Askell, Amanda, Henighan, Tom, Drain, Dawn, Perez, Ethan, Schiefer, Nicholas, Hatfield-Dodds, Zac, DasSarma, Nova, Tran-Johnson, Eli, Johnston, Scott, El-Showk, Sheer, Jones, Andy, Elhage, Nelson, Hume, Tristan, Chen, Anna, Bai, Yuntao, Bowman, Sam, Fort, Stanislav, Ganguli, Deep, Hernandez, Danny, Jacobson, Josh, Kernion, Jackson, Kravec, Shauna, Lovitt, Liane, Ndousse, Kamal, Olsson, Catherine, Ringer, Sam, Amodei, Dario, Brown, Tom, Clark, Jack, Joseph, Nicholas, Mann, Ben, McCandlish, Sam, Olah, Chris, Kaplan, Jared

论文摘要

我们研究语言模型是否可以评估自己主张的有效性，并预测他们能够正确回答的问题。我们首先表明，当以正确的格式提供时，较大的模型在多样化的多项选择和True/False问题上进行了良好的校准。因此，我们可以通过要求模型首先提出答案，然后评估其答案正确的概率“ p（true）”来对开放式采样任务进行自我评估。我们发现在各种任务中，P（true）的性能，校准和缩放令人鼓舞。当我们允许模型考虑自己的许多样本之前，在预测一种特定可能性的有效性之前，自我评估的性能进一步改善。接下来，我们研究是否可以训练模型来预测“ P（ik）”，即“我知道”问题的概率，而无需参考任何特定提出的答案。模型在预测P（IK）方面表现良好，并且在跨任务中部分概括，尽管它们在新任务上的P（IK）校准方面遇到了困难。预测的P（IK）概率在存在相关的原始材料的情况下以及在有针对数学单词问题解决方案的提示的情况下也适当增加。我们希望这些观察结果为培训更诚实的模型和调查诚实对模型训练的案例的概括为基础。

We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format. Thus we can approach self-evaluation on open-ended sampling tasks by asking models to first propose answers, and then to evaluate the probability "P(True)" that their answers are correct. We find encouraging performance, calibration, and scaling for P(True) on a diverse array of tasks. Performance at self-evaluation further improves when we allow models to consider many of their own samples before predicting the validity of one specific possibility. Next, we investigate whether models can be trained to predict "P(IK)", the probability that "I know" the answer to a question, without reference to any particular proposed answer. Models perform well at predicting P(IK) and partially generalize across tasks, though they struggle with calibration of P(IK) on new tasks. The predicted P(IK) probabilities also increase appropriately in the presence of relevant source materials in the context, and in the presence of hints towards the solution of mathematical word problems. We hope these observations lay the groundwork for training more honest models, and for investigating how honesty generalizes to cases where models are trained on objectives other than the imitation of human writing.

下载PDF全文

下载文献需遵守相关版权规定

论文标题