迭代自我监督培训的跨语性检索

论文标题

迭代自我监督培训的跨语性检索

Cross-lingual Retrieval for Iterative Self-Supervised Training

论文作者

Tran, Chau, Tang, Yuqing, Li, Xian, Gu, Jiatao

论文摘要

最近的研究表明，多语言审计的语言模型的跨语性对齐能力。在这项工作中，我们发现，可以通过使用自己的编码器输出在开采的句子对上训练SEQ2SEQ模型来进一步改善跨语言对齐。我们利用这些发现来开发一种新方法 - 迭代自学训练（CRISS）的跨语性检索，其中采矿和训练过程被迭代地应用，同时提高了跨语言的对准和翻译能力。使用这种方法，我们在9个语言方向上实现了最新的无监督的机器翻译结果，平均改善为2.4 bleu，并且在tatoeba句子的tatoeba句子检索任务上是在16种语言上的Xtreme基准中的，其平均是21.5％的绝对准确性。此外，与MBART相比，Criss在下游任务上进行了填充时，与Mbart相比，平均还带来了1.8 BLEU的改进。

Recent studies have demonstrated the cross-lingual alignment ability of multilingual pretrained language models. In this work, we found that the cross-lingual alignment can be further improved by training seq2seq models on sentence pairs mined using their own encoder outputs. We utilized these findings to develop a new approach -- cross-lingual retrieval for iterative self-supervised training (CRISS), where mining and training processes are applied iteratively, improving cross-lingual alignment and translation ability at the same time. Using this method, we achieved state-of-the-art unsupervised machine translation results on 9 language directions with an average improvement of 2.4 BLEU, and on the Tatoeba sentence retrieval task in the XTREME benchmark on 16 languages with an average improvement of 21.5% in absolute accuracy. Furthermore, CRISS also brings an additional 1.8 BLEU improvement on average compared to mBART, when finetuned on supervised machine translation downstream tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题