LRC-BERT：潜在代表性的对比知识蒸馏自然语言理解

论文标题

LRC-BERT：潜在代表性的对比知识蒸馏自然语言理解

LRC-BERT: Latent-representation Contrastive Knowledge Distillation for Natural Language Understanding

论文作者

Fu, Hao, Zhou, Shaojun, Yang, Qihong, Tang, Junjie, Liu, Guiquan, Liu, Kaikui, Li, Xiaolong

论文摘要

诸如BERT之类的预训练模型在各种自然语言处理问题中取得了巨大的结果。但是，大量参数需要大量的内存和推理时间的消耗，这使得它们很难在边缘设备上部署。在这项工作中，我们提出了一种基于对比度学习的知识蒸馏方法LRC-Bert，以适合角距离方面的中间层的输出，而现有蒸馏方法不考虑。此外，我们在训练阶段引入了基于梯度扰动的训练架构，以提高LRC-Bert的鲁棒性，这是知识蒸馏的首次尝试。此外，为了更好地捕获中间层的分布特征，我们为总蒸馏损失设计了两阶段的训练方法。最后，通过验证有关一般语言理解评估（GLUE）基准的8个数据集，提议的LRC-BERT的性能超过了现有的最新方法，这证明了我们方法的有效性。

The pre-training models such as BERT have achieved great results in various natural language processing problems. However, a large number of parameters need significant amounts of memory and the consumption of inference time, which makes it difficult to deploy them on edge devices. In this work, we propose a knowledge distillation method LRC-BERT based on contrastive learning to fit the output of the intermediate layer from the angular distance aspect, which is not considered by the existing distillation methods. Furthermore, we introduce a gradient perturbation-based training architecture in the training phase to increase the robustness of LRC-BERT, which is the first attempt in knowledge distillation. Additionally, in order to better capture the distribution characteristics of the intermediate layer, we design a two-stage training method for the total distillation loss. Finally, by verifying 8 datasets on the General Language Understanding Evaluation (GLUE) benchmark, the performance of the proposed LRC-BERT exceeds the existing state-of-the-art methods, which proves the effectiveness of our method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题