论文标题
无处不在:将多语言模型调整为新脚本
UNKs Everywhere: Adapting Multilingual Language Models to New Scripts
论文作者
论文摘要
多语言BERT等大量多语言模型在一系列NLP任务上提供了最先进的跨语性转移性能。但是,由于容量有限和预处理数据大小的差异很大,资源丰富和资源贫乏的目标语言之间存在巨大的性能差距。最终的挑战是处理模型根本没有涵盖的资源不足的语言,并在训练训练期间用看不见的脚本编写。在这项工作中,我们提出了一系列新颖的数据效率方法,这些方法可以快速有效地适应此类低资源语言和看不见的脚本。依靠矩阵分解,我们的方法利用了有关验证模型嵌入矩阵中已经可用的多种语言的现有潜在知识。此外,我们表明,可以通过利用Mbert和Target语言词汇之间共享的少量词汇(即所谓的词汇重叠代币)来改善目标语言中新的专用嵌入矩阵的学习。我们的适应技术为具有看不见的脚本的语言提供了可观的性能增长。我们还证明,它们可以为据预验证模型所涵盖的脚本编写的低资源语言产生改进。
Massively multilingual language models such as multilingual BERT offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks. However, due to limited capacity and large differences in pretraining data sizes, there is a profound performance gap between resource-rich and resource-poor target languages. The ultimate challenge is dealing with under-resourced languages not covered at all by the models and written in scripts unseen during pretraining. In this work, we propose a series of novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts. Relying on matrix factorization, our methods capitalize on the existing latent knowledge about multiple languages already available in the pretrained model's embedding matrix. Furthermore, we show that learning of the new dedicated embedding matrix in the target language can be improved by leveraging a small number of vocabulary items (i.e., the so-called lexically overlapping tokens) shared between mBERT's and target language vocabulary. Our adaptation techniques offer substantial performance gains for languages with unseen scripts. We also demonstrate that they can yield improvements for low-resource languages written in scripts covered by the pretrained model.