论文标题
语言污染有助于解释英语预算模型的跨语性功能
Language Contamination Helps Explain the Cross-lingual Capabilities of English Pretrained Models
论文作者
论文摘要
构成许多现代NLP系统的骨干的英语预算语言模型需要大量未标记的培训数据。通常将这些模型显示为仅在英语文本上进行培训,但被发现可以很好地转移到其他语言。我们调查了这一现象,发现普通的英语预处理的语料库实际上包含大量的非英语文本:即使少于1%的数据不是英语(符合强语分类器的错误率),这也会导致数亿个大规模数据集的外语标记。然后,我们证明,即使是这些较小的非英语数据,也有助于对其进行训练的模型的跨语性转移,目标语言性能与预读预训练期间的语言中数据量密切相关。鉴于这些发现,我们认为,在评估跨语性转移时,应考虑任何模型真正单语。
English pretrained language models, which make up the backbone of many modern NLP systems, require huge amounts of unlabeled training data. These models are generally presented as being trained only on English text but have been found to transfer surprisingly well to other languages. We investigate this phenomenon and find that common English pretraining corpora actually contain significant amounts of non-English text: even when less than 1% of data is not English (well within the error rate of strong language classifiers), this leads to hundreds of millions of foreign language tokens in large-scale datasets. We then demonstrate that even these small percentages of non-English data facilitate cross-lingual transfer for models trained on them, with target language performance strongly correlated to the amount of in-language data seen during pretraining. In light of these findings, we argue that no model is truly monolingual when pretrained at scale, which should be considered when evaluating cross-lingual transfer.