论文标题
与渐进的自我抗议的强大跨模式表示学习
Robust Cross-Modal Representation Learning with Progressive Self-Distillation
论文作者
论文摘要
剪辑的视觉方法的学习目标不能有效地说明Web收获图像字幕中发现的嘈杂的多到许多对应关系,这有助于其计算和数据效率低下。为了应对这一挑战,我们介绍了一个基于跨模式对比度学习的新型培训框架,该框架使用渐进的自distillation和柔和的图像文本对齐,以更有效地从嘈杂的数据中学习强大的表示。我们的模型将其自身的知识提炼为动态生成软对准目标,以在每个Minibatch中的图像和字幕子集中进行软对准目标,然后将其用于更新其参数。跨14个基准数据集进行的广泛评估表明,我们的方法在多个设置中始终优于其剪辑对应物,包括:(a)零摄像分类,(b)线性探针传输,以及(c)图像文本检索,而不会产生添加的计算成本。使用基于Imagenet的鲁棒性测试床的分析表明,与Imagenet训练的模型和剪辑本身相比,我们的方法为自然分布变化提供了更好的有效鲁棒性。最后,使用跨越两个数量级的数据集进行预读,这表明我们对夹子的改进往往会随训练示例数量扩展。
The learning objective of vision-language approach of CLIP does not effectively account for the noisy many-to-many correspondences found in web-harvested image captioning datasets, which contributes to its compute and data inefficiency. To address this challenge, we introduce a novel training framework based on cross-modal contrastive learning that uses progressive self-distillation and soft image-text alignments to more efficiently learn robust representations from noisy data. Our model distills its own knowledge to dynamically generate soft-alignment targets for a subset of images and captions in every minibatch, which are then used to update its parameters. Extensive evaluation across 14 benchmark datasets shows that our method consistently outperforms its CLIP counterpart in multiple settings, including: (a) zero-shot classification, (b) linear probe transfer, and (c) image-text retrieval, without incurring added computational cost. Analysis using an ImageNet-based robustness test-bed reveals that our method offers better effective robustness to natural distribution shifts compared to both ImageNet-trained models and CLIP itself. Lastly, pretraining with datasets spanning two orders of magnitude in size shows that our improvements over CLIP tend to scale with number of training examples.