大鸟：变压器的较长序列

论文标题

大鸟：变压器的较长序列

Big Bird: Transformers for Longer Sequences

论文作者

Zaheer, Manzil, Guruganesh, Guru, Dubey, Avinava, Ainslie, Joshua, Alberti, Chris, Ontanon, Santiago, Pham, Philip, Ravula, Anirudh, Wang, Qifan, Yang, Li, Ahmed, Amr

论文摘要

基于变形金刚的模型（例如BERT）一直是NLP最成功的深度学习模型之一。不幸的是，由于其全部注意机制，其核心局限性之一是对序列长度的二次依赖性（主要是在记忆方面）。为了解决这个问题，我们提出了大鸟，这是一种稀疏的注意机制，可降低这种二次依赖性线性。我们表明，bigbird是序列函数的通用近似值，并且已经完成，从而保留了二次，全部注意模型的这些特性。在此过程中，我们的理论分析揭示了拥有$ O（1）$全球令牌（例如CLS）的一些好处，这些好处是稀疏注意机制的一部分。提出的稀疏注意力可以处理长度的序列，最多可以使用类似的硬件来处理以前可能的8倍序列。由于能够处理更长的上下文的能力，大鸟大大提高了各种NLP任务的性能，例如问答和摘要。我们还向基因组数据提出了新的应用。

Transformers-based models, such as BERT, have been one of the most successful deep learning models for NLP. Unfortunately, one of their core limitations is the quadratic dependency (mainly in terms of memory) on the sequence length due to their full attention mechanism. To remedy this, we propose, BigBird, a sparse attention mechanism that reduces this quadratic dependency to linear. We show that BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model. Along the way, our theoretical analysis reveals some of the benefits of having $O(1)$ global tokens (such as CLS), that attend to the entire sequence as part of the sparse attention mechanism. The proposed sparse attention can handle sequences of length up to 8x of what was previously possible using similar hardware. As a consequence of the capability to handle longer context, BigBird drastically improves performance on various NLP tasks such as question answering and summarization. We also propose novel applications to genomics data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题