论文标题
测试自然语言模型的限制以预测人类语言判断
Testing the limits of natural language models for predicting human language judgments
论文作者
论文摘要
神经网络语言模型可以用作有关人类如何处理语言的计算假设。我们使用一种新型的实验方法比较了不同语言模型的模型人类一致性:有争议的句子对。对于每个有争议的句子对,两个语言模型不同意自然文本中更可能发生哪种句子。考虑到九种语言模型(包括n-gram,经常性神经网络和变压器模型),我们通过从语料库中选择句子或合成优化句子对来创建数百个有争议的句子对,以使其具有高度争议。然后,人类受试者提供了指示每个句子中的哪一对更有可能的判断。有争议的句子对被证明在揭示模型失败和确定与人类判断最紧密的模型方面非常有效。测试的最一致的模型是GPT-2,尽管实验还揭示了其与人类感知的一致性的显着缺陷。
Neural network language models can serve as computational hypotheses about how humans process language. We compared the model-human consistency of diverse language models using a novel experimental approach: controversial sentence pairs. For each controversial sentence pair, two language models disagree about which sentence is more likely to occur in natural text. Considering nine language models (including n-gram, recurrent neural networks, and transformer models), we created hundreds of such controversial sentence pairs by either selecting sentences from a corpus or synthetically optimizing sentence pairs to be highly controversial. Human subjects then provided judgments indicating for each pair which of the two sentences is more likely. Controversial sentence pairs proved highly effective at revealing model failures and identifying models that aligned most closely with human judgments. The most human-consistent model tested was GPT-2, although experiments also revealed significant shortcomings of its alignment with human perception.