论文标题

一个学生知道所有专家都知道:从稀疏到密集

One Student Knows All Experts Know: From Sparse to Dense

论文作者

Xue, Fuzhao, He, Xiaoxin, Ren, Xiaozhe, Lou, Yuxuan, You, Yang

论文摘要

人类教育系统训练由多位专家培训一名学生。 Experts(MOE)的混合物是一种强大的稀疏体系结构,包括多个专家。但是,稀疏的MOE模型易于过度拟合,很难部署,并且对于从业者而言不友好。在这项工作中,受人类教育模型的启发,我们提出了一项新的任务,知识整合,以获得像一个稀疏的Moe一样知识渊博的密集学生模型(One)。我们通过提出一个通用培训框架来调查这项任务,包括知识收集和知识蒸馏。具体来说,要从不同的预训练专家那里收集关键知识,我们首先研究了四种不同可能的知识收集方法,即求和,平均,TOP-K知识收集(TOP-KG)和单数值分解知识收集(SVD-KG)。然后,我们通过知识蒸馏来完善密集的学生模型,以抵消聚集的噪音。在ImageNet上,我们的$ 61.7 \%$从MOE中获得收益,并获得$ 78.4 \%$ $ $ TOP-1的精度成像网,只有$ 15 $ M的参数。在四个自然语言处理数据集中,使用相同的体系结构和培训数据获得了$ 88.2 \%$ MOE福利,并优于最佳基准$ 51.7 \%$。此外,与MOE同行相比,由于计算较少和硬件友好型体系结构,可以实现$ 3.7 \ times $推理的速度。

Human education system trains one student by multiple experts. Mixture-of-experts (MoE) is a powerful sparse architecture including multiple experts. However, sparse MoE model is easy to overfit, hard to deploy, and not hardware-friendly for practitioners. In this work, inspired by the human education model, we propose a novel task, knowledge integration, to obtain a dense student model (OneS) as knowledgeable as one sparse MoE. We investigate this task by proposing a general training framework including knowledge gathering and knowledge distillation. Specifically, to gather key knowledge from different pre-trained experts, we first investigate four different possible knowledge gathering methods, \ie summation, averaging, Top-K Knowledge Gathering (Top-KG), and Singular Value Decomposition Knowledge Gathering (SVD-KG) proposed in this paper. We then refine the dense student model by knowledge distillation to offset the noise from gathering. On ImageNet, our OneS preserves $61.7\%$ benefits from MoE and achieves $78.4\%$ top-1 accuracy ImageNet with only $15$M parameters. On four natural language processing datasets, OneS obtains $88.2\%$ MoE benefits and outperforms the best baseline by $51.7\%$ using the same architecture and training data. In addition, compared with the MoE counterpart, OneS can achieve $3.7 \times$ inference speedup due to less computation and hardware-friendly architecture.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源