论文标题
自我注意力的深度到宽度相互作用
The Depth-to-Width Interplay in Self-Attention
论文作者
论文摘要
自我注意的体系结构正在迅速推动自然语言处理中的边界,表现出令人惊讶的深度感知行为:以前的作品表明,增加内部表示(网络宽度)与增加自我注意力的层数(网络深度)同样有用。从理论上讲,我们预测自我注意力深度效率和深度信息之间的宽度依赖性过渡。我们对深度6至48网络进行系统的经验消融,清楚地揭示了理论上预测的行为,并就给定的自我注意力网络大小的最佳深度到宽度分配提供了明确的定量建议。超过1-亿亿次参数语言模型的竞赛呈现出知情的指南,以增加自我注意力的深度和宽度,这是必不可少的成分。我们的指南阐明了自我注意力的自我注意力网络的深度到宽度的权衡,直至GPT3的规模(我们预计它的大小太深),并且标志着前所未有的宽度为30K,是1-亿亿亿六十亿个参数网络的最佳选择。
Self-attention architectures, which are rapidly pushing the frontier in natural language processing, demonstrate a surprising depth-inefficient behavior: previous works indicate that increasing the internal representation (network width) is just as useful as increasing the number of self-attention layers (network depth). We theoretically predict a width-dependent transition between depth-efficiency and depth-inefficiency in self-attention. We conduct systematic empirical ablations on networks of depths 6 to 48 that clearly reveal the theoretically predicted behaviors, and provide explicit quantitative suggestions regarding the optimal depth-to-width allocation for a given self-attention network size. The race towards beyond 1-Trillion parameter language models renders informed guidelines for increasing self-attention depth and width in tandem an essential ingredient. Our guidelines elucidate the depth-to-width trade-off in self-attention networks of sizes up to the scale of GPT3 (which we project to be too deep for its size), and beyond, marking an unprecedented width of 30K as optimal for a 1-Trillion parameter network.