深度深度的深度变压器

论文标题

深度深度的深度变压器

Deep Transformers with Latent Depth

论文作者

Li, Xian, Stickland, Asa Cooper, Tang, Yuqing, Kong, Xiang

论文摘要

变压器模型已在许多序列建模任务中实现了最先进的性能。但是，如何利用较大或可变深度的模型容量仍然是一个开放的挑战。我们提出了一个概率框架，以自动学习通过学习层选择的后验分布来使用哪个层。作为此框架的扩展，我们提出了一种新颖的方法，用于训练一个共享的变压器网络，用于多语言机器翻译，每个语言对都有不同的层选择后验。提出的方法减轻了消失的梯度问题，并可以对深层变压器进行稳定的培训（例如100层）。我们评估了WMT英语 - 德国机器的翻译和蒙版语言建模任务，我们的方法在其中优于训练更深型变压器的现有方法。多语言机器翻译的实验表明，这种方法可以有效地利用增加的模型能力，并通过多种语言对为一对一和一对一的翻译带来普遍的改进。

The Transformer model has achieved state-of-the-art performance in many sequence modeling tasks. However, how to leverage model capacity with large or variable depths is still an open challenge. We present a probabilistic framework to automatically learn which layer(s) to use by learning the posterior distributions of layer selection. As an extension of this framework, we propose a novel method to train one shared Transformer network for multilingual machine translation with different layer selection posteriors for each language pair. The proposed method alleviates the vanishing gradient issue and enables stable training of deep Transformers (e.g. 100 layers). We evaluate on WMT English-German machine translation and masked language modeling tasks, where our method outperforms existing approaches for training deeper Transformers. Experiments on multilingual machine translation demonstrate that this approach can effectively leverage increased model capacity and bring universal improvement for both many-to-one and one-to-many translation with diverse language pairs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题