通过对比知识蒸馏改善弱监督的视觉接地

论文标题

通过对比知识蒸馏改善弱监督的视觉接地

Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation

论文作者

Wang, Liwei, Huang, Jing, Li, Yin, Xu, Kun, Yang, Zhengyuan, Yu, Dong

论文摘要

弱监督的短语接地旨在仅使用图像句子对学习区域短语对应。因此，一个重大挑战在于训练过程中图像区域和句子短语之间缺少的联系。为了应对这一挑战，我们在训练时利用通用对象检测器，并提出一个对比对比的学习框架，以解释区域词句和图像句子匹配。我们的核心创新是学习区域角度分数函数的学习，基于该范围的分数得分函数进一步构建。重要的是，通过从图像句子对中检测到的对象名称和候选短语之间的软匹配分数中提取柔软的匹配分数来学习我们的区域短语得分函数，而图像句子得分函数则由地面真实图像句子对监督。此类分数功能的设计消除了测试时对象检测的需求，从而大大降低了推理成本。如果没有铃铛和口哨声，我们的方法就可以在视觉短语接地上实现最新的结果，超过了在测试时需要昂贵的对象探测器的先前方法。

Weakly supervised phrase grounding aims at learning region-phrase correspondences using only image-sentence pairs. A major challenge thus lies in the missing links between image regions and sentence phrases during training. To address this challenge, we leverage a generic object detector at training time, and propose a contrastive learning framework that accounts for both region-phrase and image-sentence matching. Our core innovation is the learning of a region-phrase score function, based on which an image-sentence score function is further constructed. Importantly, our region-phrase score function is learned by distilling from soft matching scores between the detected object names and candidate phrases within an image-sentence pair, while the image-sentence score function is supervised by ground-truth image-sentence pairs. The design of such score functions removes the need of object detection at test time, thereby significantly reducing the inference cost. Without bells and whistles, our approach achieves state-of-the-art results on visual phrase grounding, surpassing previous methods that require expensive object detectors at test time.

下载PDF全文

下载文献需遵守相关版权规定

论文标题