在神经网络中复发与自我注意力之间的纠缠折衷

论文标题

在神经网络中复发与自我注意力之间的纠缠折衷

Untangling tradeoffs between recurrence and self-attention in neural networks

论文作者

Kerg, Giancarlo, Kanuparthi, Bhargav, Goyal, Anirudh, Goyette, Kyle, Bengio, Yoshua, Lajoie, Guillaume

论文摘要

注意力和自我发挥的机制现在是对依次任务的最新深度学习的核心。但是，最近的进步取决于启发式方法，对注意力在模型优化和计算中的作用有限，并依靠大量的记忆和计算资源来扩展较差。在这项工作中，我们对自我发作如何影响复发网络中的梯度传播进行了正式分析，并证明它在试图通过建立梯度规范的具体范围来捕获长期依赖性时减轻了消失的梯度问题。在这些结果的基础上，我们提出了一种相关筛查机制，灵感来自记忆巩固的认知过程，该过程允许与复发的稀疏自我注意力进行扩展使用。在提供保证以避免消失梯度的同时，我们使用简单的数值实验来通过有效平衡注意力和复发来证明性能和计算资源的权衡。根据我们的结果，我们提出了一个具体的研究方向，以提高专注网络的可扩展性。

Attention and self-attention mechanisms, are now central to state-of-the-art deep learning on sequential tasks. However, most recent progress hinges on heuristic approaches with limited understanding of attention's role in model optimization and computation, and rely on considerable memory and computational resources that scale poorly. In this work, we present a formal analysis of how self-attention affects gradient propagation in recurrent networks, and prove that it mitigates the problem of vanishing gradients when trying to capture long-term dependencies by establishing concrete bounds for gradient norms. Building on these results, we propose a relevancy screening mechanism, inspired by the cognitive process of memory consolidation, that allows for a scalable use of sparse self-attention with recurrence. While providing guarantees to avoid vanishing gradients, we use simple numerical experiments to demonstrate the tradeoffs in performance and computational resources by efficiently balancing attention and recurrence. Based on our results, we propose a concrete direction of research to improve scalability of attentive networks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题