论文标题
训练术语频率对少量推理的影响
Impact of Pretraining Term Frequencies on Few-Shot Reasoning
论文作者
论文摘要
预审前的语言模型(LMS)证明了通过从几个示例中的一些示例中推断出数值推理的能力。但是,这种推断依赖于鲁棒推理的程度尚不清楚。在本文中,我们研究了这些模型的原因如何,术语中的术语较少。特别是,我们检查了测试实例上的模型性能与术前数据中这些实例的术语频率之间的相关性。我们在各种数值扣除任务(例如,算术和单位转换)上测量了许多基于GPT的语言模型(在桩数据集上预估计)的相关性强度。我们的结果始终表明,模型在术语更为普遍的实例上更为准确,在某些情况下,与底部10 \%相比,在前10 \%频繁的频率上,在$ 70 \%$(绝对)的频率上更准确。总体而言,尽管LMS在少数几个数值推理任务中表现出很强的性能,但我们的结果提出了一个问题,即在训练数据训练的数据之前实际上概括了多少模型,我们鼓励研究人员在解释评估结果时考虑了预处理的数据。
Pretrained Language Models (LMs) have demonstrated ability to perform numerical reasoning by extrapolating from a few examples in few-shot settings. However, the extent to which this extrapolation relies on robust reasoning is unclear. In this paper, we investigate how well these models reason with terms that are less frequent in the pretraining data. In particular, we examine the correlations between the model performance on test instances and the frequency of terms from those instances in the pretraining data. We measure the strength of this correlation for a number of GPT-based language models (pretrained on the Pile dataset) on various numerical deduction tasks (e.g., arithmetic and unit conversion). Our results consistently demonstrate that models are more accurate on instances whose terms are more prevalent, in some cases above $70\%$ (absolute) more accurate on the top 10\% frequent terms in comparison to the bottom 10\%. Overall, although LMs exhibit strong performance at few-shot numerical reasoning tasks, our results raise the question of how much models actually generalize beyond pretraining data, and we encourage researchers to take the pretraining data into account when interpreting evaluation results.