论文标题
longformer:长篇文档变压器
Longformer: The Long-Document Transformer
论文作者
论文摘要
基于变压器的模型由于其自我发挥作用而无法处理长序列,该序列的操作与序列长度相当四次。为了解决这一限制,我们通过注意力机制介绍了长形式,该机制与序列长度线性缩放,从而易于处理数千个令牌或更长时间的文档。 Longformer的注意机制是对标准自我注意的置换式替代品,并将当地窗外的关注与任务激发了全球关注。先前在长期序列变压器上进行了工作后,我们评估了角色级语言建模的Longformer,并在Text8和Enwik8上实现了最新的结果。与大多数先前的工作相反,我们还可以在各种下游任务上预先限制纵向和填充。我们预期的纵向者在长期文档任务上始终胜过罗伯塔(Roberta),并在Wikihop和Triviaqa上设定了新的最新结果。我们最终介绍了longformer-nocoder-decoder(LED),这是一个用于支持长文档生成序列到序列任务的长形变体,并在ARXIV摘要数据集中证明了其有效性。
Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer. Longformer's attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention. Following prior work on long-sequence transformers, we evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8. In contrast to most prior work, we also pretrain Longformer and finetune it on a variety of downstream tasks. Our pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on WikiHop and TriviaQA. We finally introduce the Longformer-Encoder-Decoder (LED), a Longformer variant for supporting long document generative sequence-to-sequence tasks, and demonstrate its effectiveness on the arXiv summarization dataset.