论文标题
完整的多语言神经机器翻译
Complete Multilingual Neural Machine Translation
论文作者
论文摘要
多语言神经机器翻译(MNMT)模型通常是在以英语为中心的双语中心的联合培训的(即英语作为源语言或目标语言)。尽管有时明确可用两种非英语语言之间的直接数据,但其使用并不常见。在本文中,我们首先退后一步,看看常用的双语语料库(WMT),并重新铺面,其中存在的隐式结构的存在和重要性:跨示例跨示例的多路对齐(以两种以上语言为单位)。我们着手研究使用多道路对齐的示例来丰富原始的以英语为中心的平行语料库。我们从所有源语言和目标语言之间的多道路对齐语料库中重新引入了此直接并行数据。通过这样做,以英语为中心的图扩展为完整的图形,每个语言对都被连接。我们将其称为具有连接模式的MNMT完整的多语言神经机器翻译(CMNMT),并通过一系列实验和分析来证明其效用和功效。结合仅根据目标语言进行的新型培训数据采样策略,CMNMT可为所有语言对产生竞争性翻译质量。我们进一步研究了多路排列数据的尺寸效果,其转移学习能力以及如何在MNMT中添加新语言。最后,我们按大规模强调CMNMT,并证明我们可以训练具有高达111*112 = 12,432语言对的CMNMT模型,该模型为所有语言对提供竞争性翻译质量。
Multilingual Neural Machine Translation (MNMT) models are commonly trained on a joint set of bilingual corpora which is acutely English-centric (i.e. English either as the source or target language). While direct data between two languages that are non-English is explicitly available at times, its use is not common. In this paper, we first take a step back and look at the commonly used bilingual corpora (WMT), and resurface the existence and importance of implicit structure that existed in it: multi-way alignment across examples (the same sentence in more than two languages). We set out to study the use of multi-way aligned examples to enrich the original English-centric parallel corpora. We reintroduce this direct parallel data from multi-way aligned corpora between all source and target languages. By doing so, the English-centric graph expands into a complete graph, every language pair being connected. We call MNMT with such connectivity pattern complete Multilingual Neural Machine Translation (cMNMT) and demonstrate its utility and efficacy with a series of experiments and analysis. In combination with a novel training data sampling strategy that is conditioned on the target language only, cMNMT yields competitive translation quality for all language pairs. We further study the size effect of multi-way aligned data, its transfer learning capabilities and how it eases adding a new language in MNMT. Finally, we stress test cMNMT at scale and demonstrate that we can train a cMNMT model with up to 111*112=12,432 language pairs that provides competitive translation quality for all language pairs.