文本视频检索的归一化对比度学习

论文标题

文本视频检索的归一化对比度学习

Normalized Contrastive Learning for Text-Video Retrieval

论文作者

Park, Yookoon, Azab, Mahmoud, Xiong, Bo, Moon, Seungwhan, Metze, Florian, Kundu, Gourab, Ahmed, Kirmani

论文摘要

跨模式对比度学习以其简单性和有效性领导了多模式检索的最新进展。但是，在这项工作中，我们揭示了跨模式的对比学习遭受每个文本或视频实例总和检索概率的不正确归一化。具体而言，我们表明许多测试实例在检索过程中的代表性过高或代表性不足，从而严重损害了检索性能。为了解决这个问题，我们提出了归一化的对比度学习（NCL），该学习利用sindhorn-knopp算法来计算实例偏见，以正确地正常化每个实例的总和检索概率，以便每个文本和视频实例在交叉式检索过程中得到公平表示。实证研究表明，NCL在不同模型体系结构上带来了文本视频检索的一致性和显着的收益，并在ActivityNet，MSVD和MSR-VTT数据集上具有新的最先进的多模式检索指标，而没有任何体系结构工程。

Cross-modal contrastive learning has led the recent advances in multimodal retrieval with its simplicity and effectiveness. In this work, however, we reveal that cross-modal contrastive learning suffers from incorrect normalization of the sum retrieval probabilities of each text or video instance. Specifically, we show that many test instances are either over- or under-represented during retrieval, significantly hurting the retrieval performance. To address this problem, we propose Normalized Contrastive Learning (NCL) which utilizes the Sinkhorn-Knopp algorithm to compute the instance-wise biases that properly normalize the sum retrieval probabilities of each instance so that every text and video instance is fairly represented during cross-modal retrieval. Empirical study shows that NCL brings consistent and significant gains in text-video retrieval on different model architectures, with new state-of-the-art multimodal retrieval metrics on the ActivityNet, MSVD, and MSR-VTT datasets without any architecture engineering.

下载PDF全文

下载文献需遵守相关版权规定

论文标题