论文标题

迭代自我监督培训的跨语性检索

Cross-lingual Retrieval for Iterative Self-Supervised Training

论文作者

Tran, Chau, Tang, Yuqing, Li, Xian, Gu, Jiatao

论文摘要

最近的研究表明,多语言审计的语言模型的跨语性对齐能力。在这项工作中,我们发现,可以通过使用自己的编码器输出在开采的句子对上训练SEQ2SEQ模型来进一步改善跨语言对齐。我们利用这些发现来开发一种新方法 - 迭代自学训练(CRISS)的跨语性检索,其中采矿和训练过程被迭代地应用,同时提高了跨语言的对准和翻译能力。使用这种方法,我们在9个语言方向上实现了最新的无监督的机器翻译结果,平均改善为2.4 bleu,并且在tatoeba句子的tatoeba句子检索任务上是在16种语言上的Xtreme基准中的,其平均是21.5%的绝对准确性。此外,与MBART相比,Criss在下游任务上进行了填充时,与Mbart相比,平均还带来了1.8 BLEU的改进。

Recent studies have demonstrated the cross-lingual alignment ability of multilingual pretrained language models. In this work, we found that the cross-lingual alignment can be further improved by training seq2seq models on sentence pairs mined using their own encoder outputs. We utilized these findings to develop a new approach -- cross-lingual retrieval for iterative self-supervised training (CRISS), where mining and training processes are applied iteratively, improving cross-lingual alignment and translation ability at the same time. Using this method, we achieved state-of-the-art unsupervised machine translation results on 9 language directions with an average improvement of 2.4 BLEU, and on the Tatoeba sentence retrieval task in the XTREME benchmark on 16 languages with an average improvement of 21.5% in absolute accuracy. Furthermore, CRISS also brings an additional 1.8 BLEU improvement on average compared to mBART, when finetuned on supervised machine translation downstream tasks.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源