论文标题
自我介绍会放大希尔伯特空间的正则化
Self-Distillation Amplifies Regularization in Hilbert Space
论文作者
论文摘要
在深度学习环境中引入的知识蒸馏是一种将知识从一种体系结构转移到另一种体系结构的方法。特别是,当体系结构相同时,这称为自我抗议。这个想法是为了预测受过训练的模型作为重新培训的新目标值(并迭代几次循环)。从经验上观察到的是,自缩模的模型通常可以在持有的数据上实现更高的准确性。但是,为什么会发生这种情况是一个谜:自我验证动态不会收到有关任务的任何新信息,而仅通过循环训练而完全发展。据我们所知,对这一现象没有严格的理解。这项工作提供了对自我抗议的首次理论分析。我们专注于将非线性函数拟合到训练数据,其中模型空间为Hilbert空间,并且在此功能空间中拟合$ \ ell_2 $正规化。我们表明,通过逐步限制可用于表示解决方案的基础函数的数量来修改正则化。这意味着(正如我们也从经验上验证的那样),尽管几轮自我验证可能会减少过度拟合,但进一步的回合可能会导致拟合不足,因此性能较差。
Knowledge distillation introduced in the deep learning context is a method to transfer knowledge from one architecture to another. In particular, when the architectures are identical, this is called self-distillation. The idea is to feed in predictions of the trained model as new target values for retraining (and iterate this loop possibly a few times). It has been empirically observed that the self-distilled model often achieves higher accuracy on held out data. Why this happens, however, has been a mystery: the self-distillation dynamics does not receive any new information about the task and solely evolves by looping over training. To the best of our knowledge, there is no rigorous understanding of this phenomenon. This work provides the first theoretical analysis of self-distillation. We focus on fitting a nonlinear function to training data, where the model space is Hilbert space and fitting is subject to $\ell_2$ regularization in this function space. We show that self-distillation iterations modify regularization by progressively limiting the number of basis functions that can be used to represent the solution. This implies (as we also verify empirically) that while a few rounds of self-distillation may reduce over-fitting, further rounds may lead to under-fitting and thus worse performance.