广泛的神经网络中的知识蒸馏：风险约束，数据效率和不完美的老师

论文标题

广泛的神经网络中的知识蒸馏：风险约束，数据效率和不完美的老师

Knowledge Distillation in Wide Neural Networks: Risk Bound, Data Efficiency and Imperfect Teacher

论文作者

Ji, Guangda, Zhu, Zhanxing

论文摘要

知识蒸馏是一种通过教师网络的软输出指南培训学生网络的策略。这是一种成功的模型压缩和知识转移方法。但是，目前的知识蒸馏缺乏令人信服的理论理解。另一方面，关于神经切线内核的最新发现使我们能够通过网络随机特征的线性模型近似广泛的神经网络。在本文中，我们理论上分析了广泛的神经网络的知识蒸馏。首先，我们为网络线性化模型提供了转移风险绑定。然后，我们提出了任务培训难度的指标，称为数据效率低下。基于这个指标，我们表明，对于一个完美的老师，教师的软标签的高度可能是有益的。最后，对于不完美的老师的情况，我们发现硬标签可以纠正老师的错误预测，这解释了混合硬和软标签的实践。

Knowledge distillation is a strategy of training a student network with guide of the soft output from a teacher network. It has been a successful method of model compression and knowledge transfer. However, currently knowledge distillation lacks a convincing theoretical understanding. On the other hand, recent finding on neural tangent kernel enables us to approximate a wide neural network with a linear model of the network's random features. In this paper, we theoretically analyze the knowledge distillation of a wide neural network. First we provide a transfer risk bound for the linearized model of the network. Then we propose a metric of the task's training difficulty, called data inefficiency. Based on this metric, we show that for a perfect teacher, a high ratio of teacher's soft labels can be beneficial. Finally, for the case of imperfect teacher, we find that hard labels can correct teacher's wrong prediction, which explains the practice of mixing hard and soft labels.

下载PDF全文

下载文献需遵守相关版权规定

论文标题