用深度LSTMS重新布线变压器

论文标题

用深度LSTMS重新布线变压器

Rewiring the Transformer with Depth-Wise LSTMs

论文作者

Xu, Hongfei, Song, Yang, Liu, Qiuhui, van Genabith, Josef, Xiong, Deyi

论文摘要

堆叠非线性层允许深层神经网络对复杂的功能进行建模，并且在变压器层中包括残留连接对收敛和性能有益。但是，剩余连接可能会使模型“忘记”遥远的层，并且无法有效地融合以前层的信息。有选择地管理变压器层的表示汇总可能会带来更好的性能。在本文中，我们提出了一个具有深度LSTM的变压器，该变压器连接级联变压器层和子层。我们表明，变压器层中的层归一化和前馈计算可以吸收到连接纯变压器注意层的深度LSTM中。我们对6层变压器的实验在WMT 14英语 /法语任务和Opus-100多到多语言NMT任务中都显示出明显的BLEU改进，而我们的深层变压器实验表明，深度智能LSTM对深度变压器的融合和性能的有效性。

Stacking non-linear layers allows deep neural networks to model complicated functions, and including residual connections in Transformer layers is beneficial for convergence and performance. However, residual connections may make the model "forget" distant layers and fail to fuse information from previous layers effectively. Selectively managing the representation aggregation of Transformer layers may lead to better performance. In this paper, we present a Transformer with depth-wise LSTMs connecting cascading Transformer layers and sub-layers. We show that layer normalization and feed-forward computation within a Transformer layer can be absorbed into depth-wise LSTMs connecting pure Transformer attention layers. Our experiments with the 6-layer Transformer show significant BLEU improvements in both WMT 14 English-German / French tasks and the OPUS-100 many-to-many multilingual NMT task, and our deep Transformer experiments demonstrate the effectiveness of depth-wise LSTM on the convergence and performance of deep Transformers.

下载PDF全文

下载文献需遵守相关版权规定

论文标题