为什么较大的基于变压器的语言模型的惊喜为人类阅读时间提供了较差的拟合度？

论文标题

为什么较大的基于变压器的语言模型的惊喜为人类阅读时间提供了较差的拟合度？

Why Does Surprisal From Larger Transformer-Based Language Models Provide a Poorer Fit to Human Reading Times?

论文作者

Oh, Byung-Doh, Schuler, William

论文摘要

这项工作介绍了一个详细的语言分析，介绍了为什么具有更多参数和较低的困惑性的较大基于变压器的预训练的语言模型，尽管产生了较少预测人类阅读时间的惊奇估计值。首先，回归分析表明，在两个单独的数据集上发布了五个GPT-NEO变体和八个OPT变体之间的困惑和适合阅读时间之间的严格单调，积极的对数线性关系，在两个单独的数据集上进行了八个OPT变体，复制早期的结果仅限于GPT-2（OH等，2022）。随后，对残余误差的分析揭示了较大变体的系统偏差，例如命名实体的阅读时间不足，并对功能单词的阅读时间（例如模态和结合）进行补偿性过高。这些结果表明，较大的基于变压器模型在训练过程中“记住”序列的倾向使其出人意料的估计与人类的期望有所不同，这值得谨慎使用预训练的语言模型来研究人类语言处理。

This work presents a detailed linguistic analysis into why larger Transformer-based pre-trained language models with more parameters and lower perplexity nonetheless yield surprisal estimates that are less predictive of human reading times. First, regression analyses show a strictly monotonic, positive log-linear relationship between perplexity and fit to reading times for the more recently released five GPT-Neo variants and eight OPT variants on two separate datasets, replicating earlier results limited to just GPT-2 (Oh et al., 2022). Subsequently, analysis of residual errors reveals a systematic deviation of the larger variants, such as underpredicting reading times of named entities and making compensatory overpredictions for reading times of function words such as modals and conjunctions. These results suggest that the propensity of larger Transformer-based models to 'memorize' sequences during training makes their surprisal estimates diverge from humanlike expectations, which warrants caution in using pre-trained language models to study human language processing.

下载PDF全文

下载文献需遵守相关版权规定

论文标题