论文标题
亚线性记忆:如何使表演者苗条
Sub-Linear Memory: How to Make Performers SLiM
论文作者
论文摘要
变压器架构已彻底改变了对顺序数据的深入学习,在各种应用程序的最新解决方案中变得无处不在。然而,众所周知,香草变形金刚的资源很昂贵,需要$ o(l^2)$(l^2)$作为输入长度$ l $的功能。最近的著作提出了各种线性自我发项机制,仅缩放为串行计算的$ O(l)$。在整体计算复杂性方面,我们对具有线性自我注意的最新变压器机制进行了彻底的分析。我们观察到一个显着的计算灵活性:可以在不使用sublinear内存作为$ l $的函数的情况下进行前进和向后传播(除了输入序列可忽略不计的存储外),并以更大的时间复杂性在并行设置中进行了更大的时间复杂性。在极端情况下,表演者在培训期间仅消耗$ O(1)$内存,并且仍然需要$ O(l)$时间。这种发现的时间内存权衡可用于培训,或者由于完全的向后兼容性,用于在低内存设备上进行微调,例如智能手机或较早的GPU,因此有助于分散和民主化的深度学习。
The Transformer architecture has revolutionized deep learning on sequential data, becoming ubiquitous in state-of-the-art solutions for a wide variety of applications. Yet vanilla Transformers are notoriously resource-expensive, requiring $O(L^2)$ in serial time and memory as functions of input length $L$. Recent works proposed various linear self-attention mechanisms, scaling only as $O(L)$ for serial computation. We perform a thorough analysis of recent Transformer mechanisms with linear self-attention, Performers, in terms of overall computational complexity. We observe a remarkable computational flexibility: forward and backward propagation can be performed with no approximations using sublinear memory as a function of $L$ (in addition to negligible storage for the input sequence), at a cost of greater time complexity in the parallel setting. In the extreme case, a Performer consumes only $O(1)$ memory during training, and still requires $O(L)$ time. This discovered time-memory tradeoff can be used for training or, due to complete backward-compatibility, for fine-tuning on a low-memory device, e.g. a smartphone or an earlier-generation GPU, thus contributing towards decentralized and democratized deep learning.