论文标题
可靠的对比的自我监督学习的网络
Slimmable Networks for Contrastive Self-supervised Learning
论文作者
论文摘要
自我监督的学习在预训练大型模型中取得了重大进展,但与小型模型斗争。对此问题的主流解决方案主要依赖于知识蒸馏,该蒸馏涉及两阶段的程序:首先训练大型教师模型,然后将其提高以提高较小的教师的概括能力。在这项工作中,我们介绍了另一种单阶段解决方案,以获取预先训练的小型模型,而无需额外的教师,即,可靠的网络进行对比的自我监督学习(SLIMCLR)。一个可靠的网络由一个完整的网络和几个重量分担子网络组成,可以预先培训一次以获取各种网络,包括计算成本较低的小型网络。然而,体重分享网络之间的干扰会导致自我监管的情况下的严重绩效降解,如梯度幅度不平衡和梯度方向差异所证明的那样。前者表明,一小部分参数在反向传播过程中产生主要梯度,而主要参数可能未完全优化。后者表明梯度方向是无序的,优化过程是不稳定的。为了解决这些问题,我们引入了三种技术,以使主要参数产生主要的梯度,子网络具有一致的输出。这些技术包括对子网络的慢速培训,在线蒸馏以及根据模型尺寸重新加权。此外,提出了理论上的结果,以证明在线性评估期间,单个微小的线性层是优化的。因此,在线性评估期间应用了可切换线性探针层。我们使用典型的对比学习框架实例化Slimclr,并且比以前具有更少参数和拖鞋的艺术获得了更好的性能。该代码在https://github.com/mzhaoshuai/slimclr上。
Self-supervised learning makes significant progress in pre-training large models, but struggles with small models. Mainstream solutions to this problem rely mainly on knowledge distillation, which involves a two-stage procedure: first training a large teacher model and then distilling it to improve the generalization ability of smaller ones. In this work, we introduce another one-stage solution to obtain pre-trained small models without the need for extra teachers, namely, slimmable networks for contrastive self-supervised learning (SlimCLR). A slimmable network consists of a full network and several weight-sharing sub-networks, which can be pre-trained once to obtain various networks, including small ones with low computation costs. However, interference between weight-sharing networks leads to severe performance degradation in self-supervised cases, as evidenced by gradient magnitude imbalance and gradient direction divergence. The former indicates that a small proportion of parameters produce dominant gradients during backpropagation, while the main parameters may not be fully optimized. The latter shows that the gradient direction is disordered, and the optimization process is unstable. To address these issues, we introduce three techniques to make the main parameters produce dominant gradients and sub-networks have consistent outputs. These techniques include slow start training of sub-networks, online distillation, and loss re-weighting according to model sizes. Furthermore, theoretical results are presented to demonstrate that a single slimmable linear layer is sub-optimal during linear evaluation. Thus a switchable linear probe layer is applied during linear evaluation. We instantiate SlimCLR with typical contrastive learning frameworks and achieve better performance than previous arts with fewer parameters and FLOPs. The code is at https://github.com/mzhaoshuai/SlimCLR.