JASS：神经机器翻译的日本特定序列预先训练的序列

论文标题

JASS：神经机器翻译的日本特定序列预先训练的序列

JASS: Japanese-specific Sequence to Sequence Pre-training for Neural Machine Translation

论文作者

Mao, Zhuoyuan, Cromieres, Fabien, Dabre, Raj, Song, Haiyue, Kurohashi, Sadao

论文摘要

神经机器翻译（NMT）需要大量的平行语料库来实现最先进的翻译质量。低资源NMT通常是通过转移学习来解决的，该学习利用大型单语或并行语料库进行预培训。单语训练方法，例如质量（屏蔽序列的序列）在提高具有小的平行语料库的语言的NMT质量方面非常有效。但是，他们不考虑使用句法分析仪获得的语言信息，这对于几种自然语言处理（NLP）任务是无价的。为此，我们将JASS（日本特定的序列）提出，作为一种新型的NMT质量预训练替代品，涉及日语作为源或目标语言。 Jass是联合BMASS（BUNSETSU质量）和BRSS（BUNSETSU重新排序序列为序列）预训练，其重点是日本语言单元，称为BUNSETSUS。在我们对ASPEC日语（英语和新闻评论）的实验中，我们表明Jass可以提供竞争成果，即使不是比Mass提供的结果更好。此外，我们首次表明关节质量和JASS预训练可以显着超过表明其互补性的单个方法。我们将发布我们的代码，预训练的模型和BUNSETSU注释的数据，作为研究人员在其自己的NLP任务中使用的资源。

Neural machine translation (NMT) needs large parallel corpora for state-of-the-art translation quality. Low-resource NMT is typically addressed by transfer learning which leverages large monolingual or parallel corpora for pre-training. Monolingual pre-training approaches such as MASS (MAsked Sequence to Sequence) are extremely effective in boosting NMT quality for languages with small parallel corpora. However, they do not account for linguistic information obtained using syntactic analyzers which is known to be invaluable for several Natural Language Processing (NLP) tasks. To this end, we propose JASS, Japanese-specific Sequence to Sequence, as a novel pre-training alternative to MASS for NMT involving Japanese as the source or target language. JASS is joint BMASS (Bunsetsu MASS) and BRSS (Bunsetsu Reordering Sequence to Sequence) pre-training which focuses on Japanese linguistic units called bunsetsus. In our experiments on ASPEC Japanese--English and News Commentary Japanese--Russian translation we show that JASS can give results that are competitive with if not better than those given by MASS. Furthermore, we show for the first time that joint MASS and JASS pre-training gives results that significantly surpass the individual methods indicating their complementary nature. We will release our code, pre-trained models and bunsetsu annotated data as resources for researchers to use in their own NLP tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题