论文标题
使用多个教师助理的密集引导知识蒸馏
Densely Guided Knowledge Distillation using Multiple Teacher Assistants
论文作者
论文摘要
随着深度神经网络的成功,正在积极研究从大型教师网络学习小型学生网络的知识蒸馏,以进行模型压缩和转移学习。但是,当学生和教师模型的大小显着差异时,很少进行研究来解决学生网络的不良学习问题。在本文中,我们使用多个教师助理提出了密集的引导知识蒸馏,这些助理逐渐降低了模型大小,以有效地弥合教师和学生网络之间的巨大差距。为了刺激学生网络的更有效学习,我们指导每个其他较小的教师助手的助手迭代。具体来说,在下一步教较小的教师助理时,与教师网络一起使用了上一步的现有较大的教师助手。此外,我们设计了随机教学,在其中,对于每个迷你批次,教师或教师助理都会随机放弃。这是提高学生网络教学效率的常规化合物。因此,学生总是可以从多个来源学习出色的蒸馏知识。我们验证了使用CIFAR-10,CIFAR-100和Imagenet进行分类任务的提出方法的有效性。我们还通过Resnet,WideSnet和VGG等各种主链体系结构实现了重大的性能改进。
With the success of deep neural networks, knowledge distillation which guides the learning of a small student network from a large teacher network is being actively studied for model compression and transfer learning. However, few studies have been performed to resolve the poor learning issue of the student network when the student and teacher model sizes significantly differ. In this paper, we propose a densely guided knowledge distillation using multiple teacher assistants that gradually decreases the model size to efficiently bridge the large gap between the teacher and student networks. To stimulate more efficient learning of the student network, we guide each teacher assistant to every other smaller teacher assistants iteratively. Specifically, when teaching a smaller teacher assistant at the next step, the existing larger teacher assistants from the previous step are used as well as the teacher network. Moreover, we design stochastic teaching where, for each mini-batch, a teacher or teacher assistants are randomly dropped. This acts as a regularizer to improve the efficiency of teaching of the student network. Thus, the student can always learn salient distilled knowledge from the multiple sources. We verified the effectiveness of the proposed method for a classification task using CIFAR-10, CIFAR-100, and ImageNet. We also achieved significant performance improvements with various backbone architectures such as ResNet, WideResNet, and VGG.