论文标题
对比性视觉语言预读
Contrastive Visual-Linguistic Pretraining
论文作者
论文摘要
最近提出了几种多模式表示学习方法,例如LXMERT和VILBERT。由于在大规模多模式预处理中捕获的高级语义信息,这种方法可以实现卓越的性能。但是,由于Vilbert和LXMERT采用了视觉区域回归和分类损失,因此它们通常会遭受域间隙和嘈杂的标签问题,基于在视觉基因组数据集上预处理的视觉特征。为了克服这些问题,我们提出了公正的对比视觉语言预处理(CVLP),该训练构建了基于对比度学习的视觉自我监督的损失。我们在包括VQA,GQA和NLVR2在内的几个下游任务上评估CVLP,以验证对比度学习对多模式表示学习的优越性。我们的代码可在以下网址提供:https://github.com/archeryundong/cvlp-。
Several multi-modality representation learning approaches such as LXMERT and ViLBERT have been proposed recently. Such approaches can achieve superior performance due to the high-level semantic information captured during large-scale multimodal pretraining. However, as ViLBERT and LXMERT adopt visual region regression and classification loss, they often suffer from domain gap and noisy label problems, based on the visual features having been pretrained on the Visual Genome dataset. To overcome these issues, we propose unbiased Contrastive Visual-Linguistic Pretraining (CVLP), which constructs a visual self-supervised loss built upon contrastive learning. We evaluate CVLP on several down-stream tasks, including VQA, GQA and NLVR2 to validate the superiority of contrastive learning on multi-modality representation learning. Our code is available at: https://github.com/ArcherYunDong/CVLP-.