可区分的功能聚合搜索知识蒸馏

论文标题

可区分的功能聚合搜索知识蒸馏

Differentiable Feature Aggregation Search for Knowledge Distillation

论文作者

Guan, Yushuo, Zhao, Pengyu, Wang, Bingxuan, Zhang, Yuanxing, Yao, Cong, Bian, Kaigui, Tang, Jian

论文摘要

知识蒸馏在模型压缩中变得越来越重要。它通过监督输出分布的监督，并从复杂的教师网络中提高了小型学生网络的性能。最近的一些著作介绍了多教老师蒸馏，为学生网络提供更多监督。但是，多教老师蒸馏方法的有效性伴随着昂贵的计算资源。为了解决知识蒸馏的效率和有效性，我们介绍了功能聚合，以通过从多个教师特征地图中提取信息监督来模仿单教师蒸馏框架中的多教老师蒸馏。具体而言，我们介绍了DFA，这是一种两阶段可区分的特征聚合搜索方法，该方法是由飞镖搜索中的飞镖激励的，以有效地找到聚合。在第一阶段，DFA将搜索问题提出为双层优化，并利用新颖的桥梁损失，该桥梁损失包括一条学生对教师的路径和一条老师的道路，以找到适当的特征聚合。这两个路径彼此相对充当两个参与者，试图将统一的体系结构参数优化到相反的方向，同时保证特征聚合的表达性和可学习性。在第二阶段，DFA通过派生的特征聚合进行知识蒸馏。实验结果表明，DFA在各种教师认可的设置下胜过CIFAR-100和Cinic-10数据集的现有方法，从而验证了设计的有效性和鲁棒性。

Knowledge distillation has become increasingly important in model compression. It boosts the performance of a miniaturized student network with the supervision of the output distribution and feature maps from a sophisticated teacher network. Some recent works introduce multi-teacher distillation to provide more supervision to the student network. However, the effectiveness of multi-teacher distillation methods are accompanied by costly computation resources. To tackle with both the efficiency and the effectiveness of knowledge distillation, we introduce the feature aggregation to imitate the multi-teacher distillation in the single-teacher distillation framework by extracting informative supervision from multiple teacher feature maps. Specifically, we introduce DFA, a two-stage Differentiable Feature Aggregation search method that motivated by DARTS in neural architecture search, to efficiently find the aggregations. In the first stage, DFA formulates the searching problem as a bi-level optimization and leverages a novel bridge loss, which consists of a student-to-teacher path and a teacher-to-student path, to find appropriate feature aggregations. The two paths act as two players against each other, trying to optimize the unified architecture parameters to the opposite directions while guaranteeing both expressivity and learnability of the feature aggregation simultaneously. In the second stage, DFA performs knowledge distillation with the derived feature aggregation. Experimental results show that DFA outperforms existing methods on CIFAR-100 and CINIC-10 datasets under various teacher-student settings, verifying the effectiveness and robustness of the design.

下载PDF全文

下载文献需遵守相关版权规定

论文标题