论文标题
对比度学习的硬性负面混合
Hard Negative Mixing for Contrastive Learning
论文作者
论文摘要
对比学习已成为计算机视觉学习方法的关键组成部分。通过学习将同一图像的两个增强版本彼此接近,并将不同图像的嵌入方式分开,可以训练高度可转移的视觉表示。正如最近的研究所揭示的那样,大量的数据增强和大量负面因素对于学习此类表示至关重要。同时,通过合成新颖的示例,迫使网络学习更多强大的功能,可以在图像或特征级别上的数据混合策略改善监督和半监督学习。在本文中,我们认为,迄今为止,对比差学习的一个重要方面,即艰苦的负面影响。为了获得更有意义的负面样本,当前的顶级对比度自我监督的学习方法要么大大增加了批次大小,要么保留非常大的记忆库;但是,增加内存大小会导致性能的回报降低。因此,我们首先要深入研究一个表现最好的框架,并证明需要更艰难的负面因素以促进更好,更快的学习。基于这些观察结果,并由数据混合的成功激发,我们提出了在功能级别上的硬性负面混合策略,可以通过最小的计算开销来直接计算。我们在线性分类,对象检测和实例细分方面彻底消除了我们的方法,并表明采用我们的硬性混合过程提高了通过最先进的自学学习方法学到的视觉表示质量。
Contrastive learning has become a key component of self-supervised learning approaches for computer vision. By learning to embed two augmented versions of the same image close to each other and to push the embeddings of different images apart, one can train highly transferable visual representations. As revealed by recent studies, heavy data augmentation and large sets of negatives are both crucial in learning such representations. At the same time, data mixing strategies either at the image or the feature level improve both supervised and semi-supervised learning by synthesizing novel examples, forcing networks to learn more robust features. In this paper, we argue that an important aspect of contrastive learning, i.e., the effect of hard negatives, has so far been neglected. To get more meaningful negative samples, current top contrastive self-supervised learning approaches either substantially increase the batch sizes, or keep very large memory banks; increasing the memory size, however, leads to diminishing returns in terms of performance. We therefore start by delving deeper into a top-performing framework and show evidence that harder negatives are needed to facilitate better and faster learning. Based on these observations, and motivated by the success of data mixing, we propose hard negative mixing strategies at the feature level, that can be computed on-the-fly with a minimal computational overhead. We exhaustively ablate our approach on linear classification, object detection and instance segmentation and show that employing our hard negative mixing procedure improves the quality of visual representations learned by a state-of-the-art self-supervised learning method.