语言模型压缩的中间表示形式的对比度蒸馏

论文标题

语言模型压缩的中间表示形式的对比度蒸馏

Contrastive Distillation on Intermediate Representations for Language Model Compression

论文作者

Sun, Siqi, Gan, Zhe, Cheng, Yu, Fang, Yuwei, Wang, Shuohang, Liu, Jingjing

论文摘要

现有的语言模型压缩方法主要使用简单的L2损失来将大型BERT模型的中间表示中的知识提炼为较小的模型。尽管被广泛使用，但该目标却假定隐藏表示形式的所有维度都是独立的，未能捕获教师网络中间层中重要的结构知识。为了获得更好的蒸馏功效，我们提出了对中间表示形式（CODIR）的对比度蒸馏，这是一个原则性的知识蒸馏框架，该框架通过对比目标培训了学生通过教师的中间层来提炼知识。通过学习将积极样本与大量负面样本区分开来，Codir促进了学生在教师隐藏的层次中对丰富信息的开发。 Codir可以很容易地应用于预训练和填充阶段中的大规模语言模型，并在胶水基准上实现出色的性能，表现优于最先进的压缩方法。

Existing language model compression methods mostly use a simple L2 loss to distill knowledge in the intermediate representations of a large BERT model to a smaller one. Although widely used, this objective by design assumes that all the dimensions of hidden representations are independent, failing to capture important structural knowledge in the intermediate layers of the teacher network. To achieve better distillation efficacy, we propose Contrastive Distillation on Intermediate Representations (CoDIR), a principled knowledge distillation framework where the student is trained to distill knowledge through intermediate layers of the teacher via a contrastive objective. By learning to distinguish positive sample from a large set of negative samples, CoDIR facilitates the student's exploitation of rich information in teacher's hidden layers. CoDIR can be readily applied to compress large-scale language models in both pre-training and finetuning stages, and achieves superb performance on the GLUE benchmark, outperforming state-of-the-art compression methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题