远程变压器的NLP任务效率

论文标题

远程变压器的NLP任务效率

The NLP Task Effectiveness of Long-Range Transformers

论文作者

Qin, Guanghui, Feng, Yukun, Van Durme, Benjamin

论文摘要

由于o（n^2）时间和空间复杂性，变压器模型无法轻易扩展到长序列。这导致变压器变体试图降低计算复杂性，例如longformer和performer。尽管这些模型在理论上具有更高的效率，但它们对实际NLP任务的有效性尚未得到很好的研究。我们在5个困难的NLP任务和7个数据集上基准了7种变压器模型的变体。我们设计实验以隔离预训练和高参数设置的效果，以专注于其长期注意力的能力。此外，我们提出了各种方法，以调查注意行为，以阐明超出公制分数的模型细节。我们发现，远程变形金刚中的修改后的注意力在内容选择和查询引导的解码方面具有优势，但它们具有以前未被认可的缺点，例如对远处的值得关注和累积的近似误差。

Transformer models cannot easily scale to long sequences due to their O(N^2) time and space complexity. This has led to Transformer variants seeking to lower computational complexity, such as Longformer and Performer. While such models have theoretically greater efficiency, their effectiveness on real NLP tasks has not been well studied. We benchmark 7 variants of Transformer models on 5 difficult NLP tasks and 7 datasets. We design experiments to isolate the effect of pretraining and hyperparameter settings, to focus on their capacity for long-range attention. Moreover, we present various methods to investigate attention behaviors to illuminate model details beyond metric scores. We find that the modified attention in long-range transformers has advantages on content selection and query-guided decoding, but they come with previously unrecognized drawbacks such as insufficient attention to distant tokens and accumulated approximation error.

下载PDF全文

下载文献需遵守相关版权规定

论文标题