GMAT：变压器的全局内存增加

论文标题

GMAT：变压器的全局内存增加

GMAT: Global Memory Augmentation for Transformers

论文作者

Gupta, Ankit, Berant, Jonathan

论文摘要

由于它们的庞大能力，先天的并行性和高性能，基于变压器的模型在自然语言处理中变得无处不在。变压器块的上下文化组件是$ \ textit {pairwise dot-prododuct} $注意，该$ $ω（l^2）$内存要求长度$ l $序列，从而限制了其处理长文档的能力。最近，这是引起重大兴趣的主题，在该主题中，提出了多次近似值，以减少使用稀疏注意矩阵的二次记忆需求。在这项工作中，我们建议使用长度$ m $（$ \ ll l $）的基于密集的注意的$ \ textit {global memory} $增强稀疏变压器块，从而为每个位置提供了整个输入序列的总体视图。我们的增强具有可管理的$ O（M \ CDOT（L+M））$内存开销，并且可以与先前的稀疏解决方案无缝集成。此外，通过仅使用内存表示形式表示长输入序列，也可以将全局内存用于序列压缩。我们从经验上表明，我们的方法会导致一系列任务的重大改进，包括（a）需要全球推理的综合任务，（b）掩盖的语言建模以及（c）阅读理解。

Transformer-based models have become ubiquitous in natural language processing thanks to their large capacity, innate parallelism and high performance. The contextualizing component of a Transformer block is the $\textit{pairwise dot-product}$ attention that has a large $Ω(L^2)$ memory requirement for length $L$ sequences, limiting its ability to process long documents. This has been the subject of substantial interest recently, where multiple approximations were proposed to reduce the quadratic memory requirement using sparse attention matrices. In this work, we propose to augment sparse Transformer blocks with a dense attention-based $\textit{global memory}$ of length $M$ ($\ll L$) which provides an aggregate global view of the entire input sequence to each position. Our augmentation has a manageable $O(M\cdot(L+M))$ memory overhead, and can be seamlessly integrated with prior sparse solutions. Moreover, global memory can also be used for sequence compression, by representing a long input sequence with the memory representations only. We empirically show that our method leads to substantial improvement on a range of tasks, including (a) synthetic tasks that require global reasoning, (b) masked language modeling, and (c) reading comprehension.

下载PDF全文

下载文献需遵守相关版权规定

论文标题