论文标题
树木形成器:致密的梯度树,可有效注意计算
Treeformer: Dense Gradient Trees for Efficient Attention Computation
论文作者
论文摘要
具有输入序列长度的标准推理和基于变压器架构的训练在四边形。对于各种应用程序,尤其是在网页翻译,查询旋转等方面,这非常大,因此,最近已经开发了几种方法来通过实施不同的注意力结构(例如稀疏性,低阶,低阶,使用内核近似注意力)来加速注意计算。在这项工作中,我们将注意力计算视为最近的邻居检索的计算,并使用基于决策树的层次导航来降低每个查询令牌的检索成本,从线性序列的长度从线性长度降低到几乎对数。基于这样的层次导航,我们设计了树形的树形,它可以使用两个有效的注意层之一 - TF - 注意和TC - 注意。 TF注意力以细粒的方式计算注意力,而TC注意是一个粗糙的注意力层,它也确保了梯度“密集”。为了优化此类具有挑战性的离散层,我们提出了一种两级自举训练方法。使用对标准NLP基准测试的广泛实验,尤其是对于长期序列,我们证明了我们的树形架构几乎可以像基线变压器一样准确,同时在注意力层中使用30倍较小的失败。与Linform相比,在注意力层中使用类似的拖鞋时,准确性可能会高达12%。
Standard inference and training with transformer based architectures scale quadratically with input sequence length. This is prohibitively large for a variety of applications especially in web-page translation, query-answering etc. Consequently, several approaches have been developed recently to speedup attention computation by enforcing different attention structures such as sparsity, low-rank, approximating attention using kernels. In this work, we view attention computation as that of nearest neighbor retrieval, and use decision tree based hierarchical navigation to reduce the retrieval cost per query token from linear in sequence length to nearly logarithmic. Based on such hierarchical navigation, we design Treeformer which can use one of two efficient attention layers -- TF-Attention and TC-Attention. TF-Attention computes the attention in a fine-grained style, while TC-Attention is a coarse attention layer which also ensures that the gradients are "dense". To optimize such challenging discrete layers, we propose a two-level bootstrapped training method. Using extensive experiments on standard NLP benchmarks, especially for long-sequences, we demonstrate that our Treeformer architecture can be almost as accurate as baseline Transformer while using 30x lesser FLOPs in the attention layer. Compared to Linformer, the accuracy can be as much as 12% higher while using similar FLOPs in the attention layer.