论文标题
关于预训练的多语言表示的语言中立性
On the Language Neutrality of Pre-trained Multilingual Representations
论文作者
论文摘要
事实证明,多语言上下文嵌入(例如多语言Bert和XLM-Roberta)对许多多语性任务很有用。先前的工作使用零射传递学习在形态和句法任务上间接探讨了表示的跨语言性。相反,我们直接研究了多语言上下文嵌入的语言中性和在词汇语义方面。我们的结果表明,上下文的嵌入更是语言中性的,通常比对齐的静态单词类型嵌入更具信息性,这些嵌入是对语言中立的明确训练的。默认情况下,上下文的锻炼仍然只是中等的语言中性,因此我们提出了两种简单的方法,以实现更强的语言中立性:首先,通过将每种语言的表示形式无监督为中心,其次是通过将明确的投影拟合在小型并行数据上。此外,我们展示了如何在不使用并行数据的情况下达到语言识别方面的最新准确性,并匹配平行句子单词的统计方法的性能。
Multilingual contextual embeddings, such as multilingual BERT and XLM-RoBERTa, have proved useful for many multi-lingual tasks. Previous work probed the cross-linguality of the representations indirectly using zero-shot transfer learning on morphological and syntactic tasks. We instead investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical semantics. Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings, which are explicitly trained for language neutrality. Contextual embeddings are still only moderately language-neutral by default, so we propose two simple methods for achieving stronger language neutrality: first, by unsupervised centering of the representation for each language and second, by fitting an explicit projection on small parallel data. Besides, we show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences without using parallel data.