论文标题
参与者:通过基于跨度的动态卷积改善BERT
ConvBERT: Improving BERT with Span-based Dynamic Convolution
论文作者
论文摘要
诸如Bert及其变体之类的预训练的语言模型最近在各种自然语言理解任务中取得了令人印象深刻的表现。但是,伯特在很大程度上依赖于全球自我发场障碍,因此遭受了大量的记忆足迹和计算成本。尽管它所有的注意力都在整个输入序列上查询以从全球角度生成注意力图,但我们观察到一些头部只需要学习本地依赖性,这意味着计算冗余的存在。因此,我们提出了一种基于跨度的新型动态卷积,以取代这些自我发场的头部直接建模局部依赖性。新颖的卷积负责人,以及其余的自我发项方面的头脑,形成了一个新的混合注意力障碍,在全球和本地环境学习中都更有效。我们为Bert配备了这种混合的注意力设计,并建立了Convbert模型。实验表明,在各种下游任务中,训练成本较低,模型参数较少。值得注意的是,ConsbertBase模型达到86.4胶评分,比电视比例高0.7,而使用小于1/4的训练成本。代码和预培训模型将发布。
Pre-trained language models like BERT and its variants have recently achieved impressive performance in various natural language understanding tasks. However, BERT heavily relies on the global self-attention block and thus suffers large memory footprint and computation cost. Although all its attention heads query on the whole input sequence for generating the attention map from a global perspective, we observe some heads only need to learn local dependencies, which means the existence of computation redundancy. We therefore propose a novel span-based dynamic convolution to replace these self-attention heads to directly model local dependencies. The novel convolution heads, together with the rest self-attention heads, form a new mixed attention block that is more efficient at both global and local context learning. We equip BERT with this mixed attention design and build a ConvBERT model. Experiments have shown that ConvBERT significantly outperforms BERT and its variants in various downstream tasks, with lower training cost and fewer model parameters. Remarkably, ConvBERTbase model achieves 86.4 GLUE score, 0.7 higher than ELECTRAbase, while using less than 1/4 training cost. Code and pre-trained models will be released.