论文标题
鸟有四个腿吗? NumerSense:探测预训练语言模型的数值常识知识
Birds have four legs?! NumerSense: Probing Numerical Commonsense Knowledge of Pre-trained Language Models
论文作者
论文摘要
最近的作品表明,预培训的语言模型(PTLM),例如BERT,具有某些常识性和事实知识。他们认为,通过预测掩盖的单词,将PTLMs用作“神经知识库”是很有希望的。令人惊讶的是,我们发现这可能对数值常识性知识不起作用(例如,鸟通常有两条腿)。在本文中,我们研究了是否以及在何种程度上可以诱导PTLMS的数字常识性知识以及该过程的鲁棒性。为了研究这一点,我们介绍了一个新型的探测任务,其中包含13.6万个掩盖字预测探针(用于微调为10.5k,用于测试3.1K)的诊断数据集(NumerSense)。我们的分析表明:(1)BERT及其更强的变体罗伯塔在任何微调之前在诊断数据集上表现较差; (2)通过远处监督进行微调带来一些改善; (3)与人类绩效相比,最佳监督模型的性能仍然很差(准确性为54.06%vs 96.3%)。
Recent works show that pre-trained language models (PTLMs), such as BERT, possess certain commonsense and factual knowledge. They suggest that it is promising to use PTLMs as "neural knowledge bases" via predicting masked words. Surprisingly, we find that this may not work for numerical commonsense knowledge (e.g., a bird usually has two legs). In this paper, we investigate whether and to what extent we can induce numerical commonsense knowledge from PTLMs as well as the robustness of this process. To study this, we introduce a novel probing task with a diagnostic dataset, NumerSense, containing 13.6k masked-word-prediction probes (10.5k for fine-tuning and 3.1k for testing). Our analysis reveals that: (1) BERT and its stronger variant RoBERTa perform poorly on the diagnostic dataset prior to any fine-tuning; (2) fine-tuning with distant supervision brings some improvement; (3) the best supervised model still performs poorly as compared to human performance (54.06% vs 96.3% in accuracy).