论文标题
通过文本平滑从预训练的语言模型中提取知识
Distilling Knowledge from Pre-trained Language Models via Text Smoothing
论文作者
论文摘要
本文研究了通过教师知识蒸馏来压缩预训练的语言模型,例如Bert(Devlin等,2019)。以前的工作通常迫使学生模型严格模仿老师伯特预测的平滑标签。作为替代方案,我们提出了一种新的BERT蒸馏方法,即要求老师生成平滑的单词ID,而不是标签,以在知识蒸馏中教授学生模型。我们称这种方法平整。实际上,我们使用BERT中蒙版语言模型(MLM)的SoftMax预测来生成给定文本的单词分布,并使用预测的软字ID来平滑这些输入文本。我们假设平滑的标签和平滑的文本都可以隐式增强输入语料库,而文本平滑在直觉上更有效,因为它可以在一个神经网络远期步骤中生成更多实例。胶水的实验结果和小队的实验结果表明,我们的解决方案可以与现有的伯特蒸馏方法相比获得竞争性结果。
This paper studies compressing pre-trained language models, like BERT (Devlin et al.,2019), via teacher-student knowledge distillation. Previous works usually force the student model to strictly mimic the smoothed labels predicted by the teacher BERT. As an alternative, we propose a new method for BERT distillation, i.e., asking the teacher to generate smoothed word ids, rather than labels, for teaching the student model in knowledge distillation. We call this kind of methodTextSmoothing. Practically, we use the softmax prediction of the Masked Language Model(MLM) in BERT to generate word distributions for given texts and smooth those input texts using that predicted soft word ids. We assume that both the smoothed labels and the smoothed texts can implicitly augment the input corpus, while text smoothing is intuitively more efficient since it can generate more instances in one neural network forward step.Experimental results on GLUE and SQuAD demonstrate that our solution can achieve competitive results compared with existing BERT distillation methods.