论文标题
重新考虑表演者的注意力
Rethinking Attention with Performers
论文作者
论文摘要
我们介绍了表演者,变压器体系结构,这些架构可以估算具有可证明精度的常规(SoftMax)全级注意变压器,但仅使用线性(而不是二次)空间和时间复杂性,而无需依靠任何先验,例如稀疏性或低级度。为了近似SoftMax的注意内核,表演者通过正交正交随机特征方法(Favor+)使用新颖的快速注意,这对于可扩展的内核方法可能具有独立的兴趣。 Favor+还可以用于有效地对软磁性远远超出可甲壳化的注意机制进行建模。这种代表性对于在大规模任务中首次将SoftMax与其他内核进行准确比较,超出常规变压器的范围以及研究最佳注意力内核至关重要。表演者是与常规变压器完全兼容的线性体系结构,具有强大的理论保证:注意矩阵的无偏见或几乎稳定的估计,均匀的收敛性和较低的估计差异。我们测试了表演者在一组从像素预测到文本模型到蛋白质序列建模的丰富任务。我们通过其他检查的有效的稀疏和密集的注意方法证明了竞争结果,展示了表演者利用的新型注意力学习范式的有效性。
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can be also used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.