将BERT与适配器解码合并到平行序列中

论文标题

将BERT与适配器解码合并到平行序列中

Incorporating BERT into Parallel Sequence Decoding with Adapters

论文作者

Guo, Junliang, Zhang, Zhirui, Xu, Linli, Wei, Hao-Ran, Chen, Boxing, Chen, Enhong

论文摘要

尽管BERT之类的大规模预训练的语言模型在各种自然语言理解任务上取得了巨大的成功，但如何有效地将其纳入序列到序列模型中，相应的文本生成任务仍然是一个非平凡的问题。在本文中，我们建议通过将两个不同的BERT模型作为编码器和解码器来解决此问题，并通过引入简单且轻巧的适配器模块来对其进行微调，这些模块插入BERT层之间并在特定于任务的数据集中调整。通过这种方式，我们获得了一个灵活而有效的模型，该模型能够共同利用源侧和目标侧BERT模型中包含的信息，同时绕过灾难性的遗忘问题。框架中的每个组件都可以视为插件单元，使框架灵活和任务不可知。我们的框架是基于考虑到BERT的双向和条件独立性的平行序列解码算法，称为面具预测，并且可以轻松地适用于传统的自动回归解码。我们在神经机器翻译任务上进行了广泛的实验，在此过程中，提议的方法始终超过自回旋的基准，同时将推理潜伏期减少一半，并在IWSLT14德国英语/WMT14德国英语 - 英语 - 英语 - 英语 - 英语翻译上实现33.57美元的$ 33.57 $ BLEU得分。当适应自动回解码时，该建议的方法可在WMT14英语 - 德国/英语 - 芬奇翻译上获得$ 30.60 $/$ 43.56 $ BLEU的得分，并与最先进的基线模型相当。

While large scale pre-trained language models such as BERT have achieved great success on various natural language understanding tasks, how to efficiently and effectively incorporate them into sequence-to-sequence models and the corresponding text generation tasks remains a non-trivial problem. In this paper, we propose to address this problem by taking two different BERT models as the encoder and decoder respectively, and fine-tuning them by introducing simple and lightweight adapter modules, which are inserted between BERT layers and tuned on the task-specific dataset. In this way, we obtain a flexible and efficient model which is able to jointly leverage the information contained in the source-side and target-side BERT models, while bypassing the catastrophic forgetting problem. Each component in the framework can be considered as a plug-in unit, making the framework flexible and task agnostic. Our framework is based on a parallel sequence decoding algorithm named Mask-Predict considering the bi-directional and conditional independent nature of BERT, and can be adapted to traditional autoregressive decoding easily. We conduct extensive experiments on neural machine translation tasks where the proposed method consistently outperforms autoregressive baselines while reducing the inference latency by half, and achieves $36.49$/$33.57$ BLEU scores on IWSLT14 German-English/WMT14 German-English translation. When adapted to autoregressive decoding, the proposed method achieves $30.60$/$43.56$ BLEU scores on WMT14 English-German/English-French translation, on par with the state-of-the-art baseline models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题