亚线性记忆：如何使表演者苗条

论文标题

亚线性记忆：如何使表演者苗条

Sub-Linear Memory: How to Make Performers SLiM

论文作者

Likhosherstov, Valerii, Choromanski, Krzysztof, Davis, Jared, Song, Xingyou, Weller, Adrian

论文摘要

变压器架构已彻底改变了对顺序数据的深入学习，在各种应用程序的最新解决方案中变得无处不在。然而，众所周知，香草变形金刚的资源很昂贵，需要$ o（l^2）$（l^2）$作为输入长度$ l $的功能。最近的著作提出了各种线性自我发项机制，仅缩放为串行计算的$ O（l）$。在整体计算复杂性方面，我们对具有线性自我注意的最新变压器机制进行了彻底的分析。我们观察到一个显着的计算灵活性：可以在不使用sublinear内存作为$ l $的函数的情况下进行前进和向后传播（除了输入序列可忽略不计的存储外），并以更大的时间复杂性在并行设置中进行了更大的时间复杂性。在极端情况下，表演者在培训期间仅消耗$ O（1）$内存，并且仍然需要$ O（l）$时间。这种发现的时间内存权衡可用于培训，或者由于完全的向后兼容性，用于在低内存设备上进行微调，例如智能手机或较早的GPU，因此有助于分散和民主化的深度学习。

The Transformer architecture has revolutionized deep learning on sequential data, becoming ubiquitous in state-of-the-art solutions for a wide variety of applications. Yet vanilla Transformers are notoriously resource-expensive, requiring $O(L^2)$ in serial time and memory as functions of input length $L$. Recent works proposed various linear self-attention mechanisms, scaling only as $O(L)$ for serial computation. We perform a thorough analysis of recent Transformer mechanisms with linear self-attention, Performers, in terms of overall computational complexity. We observe a remarkable computational flexibility: forward and backward propagation can be performed with no approximations using sublinear memory as a function of $L$ (in addition to negligible storage for the input sequence), at a cost of greater time complexity in the parallel setting. In the extreme case, a Performer consumes only $O(1)$ memory during training, and still requires $O(L)$ time. This discovered time-memory tradeoff can be used for training or, due to complete backward-compatibility, for fine-tuning on a low-memory device, e.g. a smartphone or an earlier-generation GPU, thus contributing towards decentralized and democratized deep learning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题