论文标题
Lightpaff:用于预训练和微调的两阶段蒸馏框架
LightPAFF: A Two-Stage Distillation Framework for Pre-training and Fine-tuning
论文作者
论文摘要
虽然预训练和微调,例如,bert〜 \ citep {devlin2018bert},gpt-2〜 \ citep {radford2019language}在语言理解和发电任务方面取得了巨大的成功,而预先培训的模型通常在在线成本和在线速度方面的实践,而这些模型通常太大了。在本文中,我们提出了LightPaff,这是一个轻巧的预训练和微调框架,利用两阶段的知识蒸馏将知识从大教师模型转移到预培训和微调阶段的轻量级学生模型中。通过这种方式,轻型模型可以达到与大教师模型相似的准确性,但是参数较少,从而更快地在线推理速度。 LightPaff可以支持不同的预训练方法(例如BERT,GPT-2和质量〜\ citep {Song2019mass}),并应用于许多下游任务。在三个语言理解任务,三个语言建模任务和三个顺序生成任务的序列上进行的实验表明,尽管使用Big Bert,GPT-2和Mass Models实现了相似的精度,但LightPaff将模型大小降低了近5倍,并将在线推断速度提高了5x-7x。
While pre-training and fine-tuning, e.g., BERT~\citep{devlin2018bert}, GPT-2~\citep{radford2019language}, have achieved great success in language understanding and generation tasks, the pre-trained models are usually too big for online deployment in terms of both memory cost and inference speed, which hinders them from practical online usage. In this paper, we propose LightPAFF, a Lightweight Pre-training And Fine-tuning Framework that leverages two-stage knowledge distillation to transfer knowledge from a big teacher model to a lightweight student model in both pre-training and fine-tuning stages. In this way the lightweight model can achieve similar accuracy as the big teacher model, but with much fewer parameters and thus faster online inference speed. LightPAFF can support different pre-training methods (such as BERT, GPT-2 and MASS~\citep{song2019mass}) and be applied to many downstream tasks. Experiments on three language understanding tasks, three language modeling tasks and three sequence to sequence generation tasks demonstrate that while achieving similar accuracy with the big BERT, GPT-2 and MASS models, LightPAFF reduces the model size by nearly 5x and improves online inference speed by 5x-7x.