论文标题
meta-kd:跨域的语言模型压缩的元知识蒸馏框架
Meta-KD: A Meta Knowledge Distillation Framework for Language Model Compression across Domains
论文作者
论文摘要
预训练的语言模型已应用于具有大量性能提升的各种NLP任务。但是,大型模型大小以及较长的推理时间限制了在实时应用程序中的部署。一系列模型压缩方法考虑了知识蒸馏,以将大型教师模型提炼成小型学生模型。这些研究大多数仅着眼于单域,这些域忽略了其他领域的可转移知识。我们注意到,培训具有可转移知识的老师在跨领域消化的知识可以实现更好的概括能力,以帮助知识蒸馏。因此,我们提出了一个元知识蒸馏(Meta-KD)框架,以建立一个元老师模型,该模型捕获跨领域的可转移知识并将这些知识传递给学生。具体而言,我们明确地迫使元教师从多个领域中捕获实例级别和特征级别的可转移知识,然后提出一种元依据算法,以在元教师的指导下学习单域学生模型。公共多域NLP任务的实验显示了拟议的元KD框架的有效性和优势。此外,我们还证明了元KD在稀缺数据的设置中的能力。
Pre-trained language models have been applied to various NLP tasks with considerable performance gains. However, the large model sizes, together with the long inference time, limit the deployment of such models in real-time applications. One line of model compression approaches considers knowledge distillation to distill large teacher models into small student models. Most of these studies focus on single-domain only, which ignores the transferable knowledge from other domains. We notice that training a teacher with transferable knowledge digested across domains can achieve better generalization capability to help knowledge distillation. Hence we propose a Meta-Knowledge Distillation (Meta-KD) framework to build a meta-teacher model that captures transferable knowledge across domains and passes such knowledge to students. Specifically, we explicitly force the meta-teacher to capture transferable knowledge at both instance-level and feature-level from multiple domains, and then propose a meta-distillation algorithm to learn single-domain student models with guidance from the meta-teacher. Experiments on public multi-domain NLP tasks show the effectiveness and superiority of the proposed Meta-KD framework. Further, we also demonstrate the capability of Meta-KD in the settings where the training data is scarce.