多头关注：协作而不是连接

论文标题

多头关注：协作而不是连接

Multi-Head Attention: Collaborate Instead of Concatenate

论文作者

Cordonnier, Jean-Baptiste, Loukas, Andreas, Jaggi, Martin

论文摘要

注意层被广泛用于自然语言处理（NLP），并开始影响计算机视觉体系结构。训练非常大的变压器模型可以在两个领域进行显着改善，但是一旦受过训练，这些网络就会显示出过度参数化的症状。例如，众所周知，许多注意力头都可以修剪而不会影响准确性。这项工作旨在增强对多个头部相互作用方式的当前理解。由于注意力头脑学习冗余键/查询预测的观察，我们提出了一个协作多头注意层，使负责人能够学习共享的预测。我们的方案减少了注意力层中的参数数量，可以用作任何变压器体系结构中的倒入替换。我们的实验证实，共享密钥/查询维度可以在语言理解，机器翻译和远景中利用。我们还表明，可以将预先训练的多头注意力层重新排入我们的协作注意力层。以相同的精度和速度，协作多头注意力将密钥和查询预测的大小减少了4。我们的代码是公开的。

Attention layers are widely used in natural language processing (NLP) and are beginning to influence computer vision architectures. Training very large transformer models allowed significant improvement in both fields, but once trained, these networks show symptoms of over-parameterization. For instance, it is known that many attention heads can be pruned without impacting accuracy. This work aims to enhance current understanding on how multiple heads interact. Motivated by the observation that attention heads learn redundant key/query projections, we propose a collaborative multi-head attention layer that enables heads to learn shared projections. Our scheme decreases the number of parameters in an attention layer and can be used as a drop-in replacement in any transformer architecture. Our experiments confirm that sharing key/query dimensions can be exploited in language understanding, machine translation and vision. We also show that it is possible to re-parametrize a pre-trained multi-head attention layer into our collaborative attention layer. Collaborative multi-head attention reduces the size of the key and query projections by 4 for same accuracy and speed. Our code is public.

下载PDF全文

下载文献需遵守相关版权规定

论文标题