通过路由变压器有效的有效基于内容的稀疏关注

论文标题

通过路由变压器有效的有效基于内容的稀疏关注

Efficient Content-Based Sparse Attention with Routing Transformers

论文作者

Roy, Aurko, Saffar, Mohammad, Vaswani, Ashish, Grangier, David

论文摘要

最近已经针对各种序列建模问题采用了自我注意事项。尽管具有有效性，但自我发作仍遭受二次计算和记忆要求，相对于序列长度。降低这种复杂性的成功方法集中在参与本地滑动窗口或独立于内容的一小部分位置。我们的工作建议学习动态的稀疏注意模式，以避免分配计算和记忆，以关注与关注查询无关的内容。这项工作以两条研究为基础：它结合了先前工作对基于内容的稀疏关注的建模灵活性以及基于本地时间稀疏注意的方法的效率提高。我们的模型，路由变压器，以基于在线k均值的稀疏路由模块赋予自我注意事项，同时将注意力的整体复杂性从$ o \ of \ le \ left（n^2d \ right）的$ o \ left（n^{1.5} d \ right）$降低到序列长度$ n $ $ n $和隐藏的dimension $ d $。我们表明，我们的模型在Wikitext-103上的语言建模（15.8 vs 18.3的困惑）以及Imagenet-64上的图像生成（3.43 vs 3.44 vs 3.44位/昏暗）上的语言建模上优于可比较的稀疏注意模型，同时使用了较少的自我注意层。此外，我们在新发布的PG-19数据集上设置了一个新的最先进的方法，并获得了33.2的测试困惑，并具有22层路由变压器模型，该模型对长度为8192的序列进行了训练。

Self-attention has recently been adopted for a wide range of sequence modeling problems. Despite its effectiveness, self-attention suffers from quadratic compute and memory requirements with respect to sequence length. Successful approaches to reduce this complexity focused on attending to local sliding windows or a small set of locations independent of content. Our work proposes to learn dynamic sparse attention patterns that avoid allocating computation and memory to attend to content unrelated to the query of interest. This work builds upon two lines of research: it combines the modeling flexibility of prior work on content-based sparse attention with the efficiency gains from approaches based on local, temporal sparse attention. Our model, the Routing Transformer, endows self-attention with a sparse routing module based on online k-means while reducing the overall complexity of attention to $O\left(n^{1.5}d\right)$ from $O\left(n^2d\right)$ for sequence length $n$ and hidden dimension $d$. We show that our model outperforms comparable sparse attention models on language modeling on Wikitext-103 (15.8 vs 18.3 perplexity) as well as on image generation on ImageNet-64 (3.43 vs 3.44 bits/dim) while using fewer self-attention layers. Additionally, we set a new state-of-the-art on the newly released PG-19 data-set, obtaining a test perplexity of 33.2 with a 22 layer Routing Transformer model trained on sequences of length 8192.

下载PDF全文

下载文献需遵守相关版权规定

论文标题