DECBERT：通过因果关注面具增强对Bert的语言理解

论文标题

DECBERT：通过因果关注面具增强对Bert的语言理解

DecBERT: Enhancing the Language Understanding of BERT with Causal Attention Masks

论文作者

Luo, Ziyang, Xi, Yadong, Ma, Jing, Yang, Zhiwei, Mao, Xiaoxi, Fan, Changjie, Zhang, Rongsheng

论文摘要

自2017年以来，基于变压器的模型在各种下游自然语言处理任务中起关键作用。但是，在变压器编码器中使用的注意机制的一个共同局限性是它不能自动捕获单词顺序的信息，因此通常需要嵌入显式位置嵌入到目标模型中。相反，带有因果注意面罩的变压器解码器自然对单词顺序敏感。在这项工作中，我们专注于提高BERT使用因果注意面罩的编码能力。此外，我们提出了一种新的预训练的语言模型Decbert，并在胶水基准上进行评估。实验结果表明，（1）因果关注面具对BERT在语言理解任务方面有效；（2）我们没有位置嵌入的decbert模型在胶水基准上实现了可比的性能；（3）我们的修改加速了预训练过程，而decbert w/ pe可以比使用相同数量的计算资源进行培训时的基线系统更好。

Since 2017, the Transformer-based models play critical roles in various downstream Natural Language Processing tasks. However, a common limitation of the attention mechanism utilized in Transformer Encoder is that it cannot automatically capture the information of word order, so explicit position embeddings are generally required to be fed into the target model. In contrast, Transformer Decoder with the causal attention masks is naturally sensitive to the word order. In this work, we focus on improving the position encoding ability of BERT with the causal attention masks. Furthermore, we propose a new pre-trained language model DecBERT and evaluate it on the GLUE benchmark. Experimental results show that (1) the causal attention mask is effective for BERT on the language understanding tasks; (2) our DecBERT model without position embeddings achieve comparable performance on the GLUE benchmark; and (3) our modification accelerates the pre-training process and DecBERT w/ PE achieves better overall performance than the baseline systems when pre-training with the same amount of computational resources.

下载PDF全文

下载文献需遵守相关版权规定

论文标题