论文标题
将多语言bert扩展到低资源语言
Extending Multilingual BERT to Low-Resource Languages
论文作者
论文摘要
多语言伯特(M-bert)在监督和零击的跨语性转移学习方面取得了巨大的成功。但是,这种成功仅集中在维基百科的前104种语言上。在本文中,我们提出了一种简单但有效的方法来扩展M-Bert(E-Bert),以便它可以使任何新语言受益,并表明我们的方法也有益于已经在M-Bert中的语言。我们对27种语言进行了命名实体识别(NER)进行广泛的实验,其中只有16种在M-Bert中,并且在M-Bert中已经在M-Bert中的语言中平均增加了约6%的F1,而新语言的F1则增加了23%。
Multilingual BERT (M-BERT) has been a huge success in both supervised and zero-shot cross-lingual transfer learning. However, this success has focused only on the top 104 languages in Wikipedia that it was trained on. In this paper, we propose a simple but effective approach to extend M-BERT (E-BERT) so that it can benefit any new language, and show that our approach benefits languages that are already in M-BERT as well. We perform an extensive set of experiments with Named Entity Recognition (NER) on 27 languages, only 16 of which are in M-BERT, and show an average increase of about 6% F1 on languages that are already in M-BERT and 23% F1 increase on new languages.