论文标题

确定伯特多语言的必要元素

Identifying Necessary Elements for BERT's Multilinguality

论文作者

Dufter, Philipp, Schütze, Hinrich

论文摘要

已经表明,多语言伯特(Mbert)产生高质量的多语言表示,并可以有效地零转移。鉴于Mbert在训练过程中没有使用任何跨语言信号,这是令人惊讶的。尽管最近的文献研究了这种现象,但多语言的原因仍然有些晦涩。我们旨在确定BERT和语言的语言特性的建筑特性,这些属性是Bert成为多语言所必需的。为了进行快速实验,我们提出了一种有效的设置,该设置是通过培训合成和自然数据的小型BERT模型。总体而言,我们确定了影响多语言性的四个建筑和两个语言元素。根据我们的见解,我们尝试使用VecMap(即无监督的嵌入对齐)修改掩盖策略的多语言预处理设置。用三种语言进行XNLI的实验表明,我们的发现从小型设置转移到了较大的规模设置。

It has been shown that multilingual BERT (mBERT) yields high quality multilingual representations and enables effective zero-shot transfer. This is surprising given that mBERT does not use any crosslingual signal during training. While recent literature has studied this phenomenon, the reasons for the multilinguality are still somewhat obscure. We aim to identify architectural properties of BERT and linguistic properties of languages that are necessary for BERT to become multilingual. To allow for fast experimentation we propose an efficient setup with small BERT models trained on a mix of synthetic and natural data. Overall, we identify four architectural and two linguistic elements that influence multilinguality. Based on our insights, we experiment with a multilingual pretraining setup that modifies the masking strategy using VecMap, i.e., unsupervised embedding alignment. Experiments on XNLI with three languages indicate that our findings transfer from our small setup to larger scale settings.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源