论文标题
瓦斯尔斯坦对比代表蒸馏
Wasserstein Contrastive Representation Distillation
论文作者
论文摘要
知识蒸馏(KD)的主要目标是将从教师网络学到的模型的信息封装到学生网络中,后者比前者更紧凑。现有工作,例如,使用kullback-leibler差异进行蒸馏,可能无法捕获教师网络中的重要结构知识,并且通常缺乏特征概括的能力,尤其是在构建教师和学生以解决不同分类任务的情况下。我们提出了Wasestein的对比表示蒸馏(WCORD),该蒸馏利用KD的瓦斯汀距离的原始形式和双重形式。双重形式用于全球知识转移,产生了对比度学习目标,该目标最大化了教师和学生网络之间相互信息的下限。原始形式用于在迷你批次内的本地对比知识转移,有效地与教师和学生网络之间的功能分布匹配。实验表明,提出的WCORD方法在特权信息蒸馏,模型压缩和跨模式转移方面优于最先进的方法。
The primary goal of knowledge distillation (KD) is to encapsulate the information of a model learned from a teacher network into a student network, with the latter being more compact than the former. Existing work, e.g., using Kullback-Leibler divergence for distillation, may fail to capture important structural knowledge in the teacher network and often lacks the ability for feature generalization, particularly in situations when teacher and student are built to address different classification tasks. We propose Wasserstein Contrastive Representation Distillation (WCoRD), which leverages both primal and dual forms of Wasserstein distance for KD. The dual form is used for global knowledge transfer, yielding a contrastive learning objective that maximizes the lower bound of mutual information between the teacher and the student networks. The primal form is used for local contrastive knowledge transfer within a mini-batch, effectively matching the distributions of features between the teacher and the student networks. Experiments demonstrate that the proposed WCoRD method outperforms state-of-the-art approaches on privileged information distillation, model compression and cross-modal transfer.