论文标题
MaskClip:掩盖自我依赖的进步对比语言图像预处理
MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining
论文作者
论文摘要
本文提出了一个简单而有效的框架蒙版,该框架将新提出的掩盖自distillation纳入对比的语言图像训练中。掩盖自distillation的核心思想是将表示从完整的图像提炼为掩盖图像预测的表示形式。这种合并享有两个至关重要的好处。首先,掩盖的自我验证针对局部贴片表示学习,这是与文本相关表示的视觉对比的补充。其次,从训练目标的角度来看,掩盖的自我缩减也与视觉语言对比是一致的,因为两者都利用视觉编码器进行特征对齐,因此能够学习本地语义从语言中获得间接监督。我们提供了专门设计的实验,并进行了全面的分析,以验证这两个好处。对称地,我们还将局部语义监督介绍给文本分支,这进一步改善了预处理的性能。通过广泛的实验,我们表明,当MaskClip应用于各种具有挑战性的下游任务时,可以在语言编码器的指导下实现线性探测,填充和零拍性能的卓越结果。代码将在\ url {https://github.com/lightdxy/maskClip}中发布。
This paper presents a simple yet effective framework MaskCLIP, which incorporates a newly proposed masked self-distillation into contrastive language-image pretraining. The core idea of masked self-distillation is to distill representation from a full image to the representation predicted from a masked image. Such incorporation enjoys two vital benefits. First, masked self-distillation targets local patch representation learning, which is complementary to vision-language contrastive focusing on text-related representation. Second, masked self-distillation is also consistent with vision-language contrastive from the perspective of training objective as both utilize the visual encoder for feature aligning, and thus is able to learn local semantics getting indirect supervision from the language. We provide specially designed experiments with a comprehensive analysis to validate the two benefits. Symmetrically, we also introduce the local semantic supervision into the text branch, which further improves the pretraining performance. With extensive experiments, we show that MaskCLIP, when applied to various challenging downstream tasks, achieves superior results in linear probing, finetuning, and zero-shot performance with the guidance of the language encoder. Code will be release at \url{https://github.com/LightDXY/MaskCLIP}.