视频文本表示学习的支持集瓶颈学习

论文标题

视频文本表示学习的支持集瓶颈学习

Support-set bottlenecks for video-text representation learning

论文作者

Patrick, Mandela, Huang, Po-Yao, Asano, Yuki, Metze, Florian, Hauptmann, Alexander, Henriques, João, Vedaldi, Andrea

论文摘要

学习视频文本表示形式的主要范式 - 噪声对比学习 - 增加了已知相关的样本对形式的相似性，例如来自同一样本的文本和视频，并推开所有其他对的表示。我们认为，最后的行为太严格了，即使对于与语义相关的样本的样本，也可以执行不同的表示形式 - 例如，视觉上相似的视频或共享相同描绘动作的视频。在本文中，我们提出了一种新颖的方法，通过利用生成模型将这些相关样品自然推出来减轻这种方法：每个样本的标题必须作为其他支持样品的视觉表示的加权组合重建。这个简单的想法可确保表示形式不会过分专注于单个样本，在整个数据集中都可以重复使用，并导致表示在样本之间明确编码语义的表示形式，与噪声对比的学习不同。我们提出的方法在MSR-VTT，VATEX和ActivityNet上的优于其他方法，以及用于视频到文本和文本对视频检索的MSVD。

The dominant paradigm for learning video-text representations -- noise contrastive learning -- increases the similarity of the representations of pairs of samples that are known to be related, such as text and video from the same sample, and pushes away the representations of all other pairs. We posit that this last behaviour is too strict, enforcing dissimilar representations even for samples that are semantically-related -- for example, visually similar videos or ones that share the same depicted action. In this paper, we propose a novel method that alleviates this by leveraging a generative model to naturally push these related samples together: each sample's caption must be reconstructed as a weighted combination of other support samples' visual representations. This simple idea ensures that representations are not overly-specialized to individual samples, are reusable across the dataset, and results in representations that explicitly encode semantics shared between samples, unlike noise contrastive learning. Our proposed method outperforms others by a large margin on MSR-VTT, VATEX and ActivityNet, and MSVD for video-to-text and text-to-video retrieval.

下载PDF全文

下载文献需遵守相关版权规定

论文标题