多分支专注的变压器

论文标题

多分支专注的变压器

Multi-branch Attentive Transformer

论文作者

Fan, Yang, Xie, Shufang, Xia, Yingce, Wu, Lijun, Qin, Tao, Li, Xiang-Yang, Liu, Tie-Yan

论文摘要

虽然多分支体系结构是计算机视觉任务成功的关键要素之一，但在自然语言处理中，尤其是序列学习任务的研究并未得到很好的研究。在这项工作中，我们提出了一个简单而有效的变压器变体，称为多分支专注的变压器（简短，垫子），其中注意力层是多个分支的平均值，每个分支都是独立的多头注意力层。我们利用两种训练技术来正规化培训：Drop-Branch，该技术在训练过程中随机掉落单个分支，以及近端初始化，该初始化使用了预先训练的变压器模型来初始化多个分支。关于机器翻译，代码生成和自然语言理解的实验表明，这种简单的变压器变体带来了重大改进。我们的代码可在\ url {https://github.com/ha-transformer}上找到。

While the multi-branch architecture is one of the key ingredients to the success of computer vision tasks, it has not been well investigated in natural language processing, especially sequence learning tasks. In this work, we propose a simple yet effective variant of Transformer called multi-branch attentive Transformer (briefly, MAT), where the attention layer is the average of multiple branches and each branch is an independent multi-head attention layer. We leverage two training techniques to regularize the training: drop-branch, which randomly drops individual branches during training, and proximal initialization, which uses a pre-trained Transformer model to initialize multiple branches. Experiments on machine translation, code generation and natural language understanding demonstrate that such a simple variant of Transformer brings significant improvements. Our code is available at \url{https://github.com/HA-Transformer}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题