B2T连接：深层变压器中的稳定性和性能

论文标题

B2T连接：深层变压器中的稳定性和性能

B2T Connection: Serving Stability and Performance in Deep Transformers

论文作者

Takase, Sho, Kiyono, Shun, Kobayashi, Sosuke, Suzuki, Jun

论文摘要

从层归一化（LN）位置的角度来看，变压器的体系结构可以分为两种类型：LN和PERE-LN。最近的变压器往往是前LN，因为在LN中具有深层变压器（例如，具有十个或更多层的层），训练通常是不稳定的，导致了无用的模型。但是，LN在相对较浅的变压器（例如，层或更少的层）中始终取得比LN更好的性能。这项研究首先研究了这些差异观察的原因，从经验和理论上进行了以下发现：1，LN中的LN是消失的梯度问题的主要来源，它导致培训不稳定，而前LN预防了它，而2，LN倾向于在较高的层次中保留更大的梯度在较高的层次中，可以在较高的层次中进行培训，这可能会导致训练，这可能会导致辅助辅助辅助效果。利用新发现，我们提出了一种可以通过简单的LN修改来提供高稳定性和有效培训的方法。我们对各种文本生成任务进行实验。实验结果表明，无论浅层或深层设置如何，我们的方法都优于前LN，并且可以进行稳定的训练。我们的代码可在https://github.com/takase/b2t_connection上公开获取。

From the perspective of the layer normalization (LN) positions, the architectures of Transformers can be categorized into two types: Post-LN and Pre-LN. Recent Transformers tend to be Pre-LN because, in Post-LN with deep Transformers (e.g., those with ten or more layers), the training is often unstable, resulting in useless models. However, Post-LN has consistently achieved better performance than Pre-LN in relatively shallow Transformers (e.g., those with six or fewer layers). This study first investigates the reason for these discrepant observations empirically and theoretically and made the following discoveries: 1, the LN in Post-LN is the main source of the vanishing gradient problem that leads to unstable training, whereas Pre-LN prevents it, and 2, Post-LN tends to preserve larger gradient norms in higher layers during the back-propagation, which may lead to effective training. Exploiting the new findings, we propose a method that can provide both high stability and effective training by a simple modification of Post-LN. We conduct experiments on a wide range of text generation tasks. The experimental results demonstrate that our method outperforms Pre-LN, and enables stable training regardless of the shallow or deep layer settings. Our code is publicly available at https://github.com/takase/b2t_connection.

下载PDF全文

下载文献需遵守相关版权规定

论文标题