论文标题
B2T连接:深层变压器中的稳定性和性能
B2T Connection: Serving Stability and Performance in Deep Transformers
论文作者
论文摘要
从层归一化(LN)位置的角度来看,变压器的体系结构可以分为两种类型:LN和PERE-LN。最近的变压器往往是前LN,因为在LN中具有深层变压器(例如,具有十个或更多层的层),训练通常是不稳定的,导致了无用的模型。但是,LN在相对较浅的变压器(例如,层或更少的层)中始终取得比LN更好的性能。这项研究首先研究了这些差异观察的原因,从经验和理论上进行了以下发现:1,LN中的LN是消失的梯度问题的主要来源,它导致培训不稳定,而前LN预防了它,而2,LN倾向于在较高的层次中保留更大的梯度在较高的层次中,可以在较高的层次中进行培训,这可能会导致训练,这可能会导致辅助辅助辅助效果。利用新发现,我们提出了一种可以通过简单的LN修改来提供高稳定性和有效培训的方法。我们对各种文本生成任务进行实验。实验结果表明,无论浅层或深层设置如何,我们的方法都优于前LN,并且可以进行稳定的训练。我们的代码可在https://github.com/takase/b2t_connection上公开获取。
From the perspective of the layer normalization (LN) positions, the architectures of Transformers can be categorized into two types: Post-LN and Pre-LN. Recent Transformers tend to be Pre-LN because, in Post-LN with deep Transformers (e.g., those with ten or more layers), the training is often unstable, resulting in useless models. However, Post-LN has consistently achieved better performance than Pre-LN in relatively shallow Transformers (e.g., those with six or fewer layers). This study first investigates the reason for these discrepant observations empirically and theoretically and made the following discoveries: 1, the LN in Post-LN is the main source of the vanishing gradient problem that leads to unstable training, whereas Pre-LN prevents it, and 2, Post-LN tends to preserve larger gradient norms in higher layers during the back-propagation, which may lead to effective training. Exploiting the new findings, we propose a method that can provide both high stability and effective training by a simple modification of Post-LN. We conduct experiments on a wide range of text generation tasks. The experimental results demonstrate that our method outperforms Pre-LN, and enables stable training regardless of the shallow or deep layer settings. Our code is publicly available at https://github.com/takase/b2t_connection.