从图像标题中学习音频视频模式

论文标题

从图像标题中学习音频视频模式

Learning Audio-Video Modalities from Image Captions

论文作者

Nagrani, Arsha, Seo, Paul Hongsuck, Seybold, Bryan, Hauth, Anja, Manen, Santiago, Sun, Chen, Schmid, Cordelia

论文摘要

文本视频和文本审计检索的主要挑战是缺乏大规模培训数据。这与图像捕获不同，其中数据集按数百万个样本的顺序。为了缩小此差距，我们提出了一条新的视频挖掘管道，其中涉及将字幕从图像字幕数据集转移到视频片段，而没有其他手动努力。使用此管道，我们创建了一个新的大型，弱标记的音频视频字幕数据集，该数据集由数百万个配对夹和字幕组成。我们表明，培训一个基于多模式转换的模型，可以在视频检索和视频字幕，匹配甚至胜过Howto100m预处理上获得竞争性能，而较少的剪辑较少20倍。我们还表明，我们的采矿片段适合预审查文本，并实现了音频检索任务的最新成果。

A major challenge in text-video and text-audio retrieval is the lack of large-scale training data. This is unlike image-captioning, where datasets are in the order of millions of samples. To close this gap we propose a new video mining pipeline which involves transferring captions from image captioning datasets to video clips with no additional manual effort. Using this pipeline, we create a new large-scale, weakly labelled audio-video captioning dataset consisting of millions of paired clips and captions. We show that training a multimodal transformed based model on this data achieves competitive performance on video retrieval and video captioning, matching or even outperforming HowTo100M pretraining with 20x fewer clips. We also show that our mined clips are suitable for text-audio pretraining, and achieve state of the art results for the task of audio retrieval.

下载PDF全文

下载文献需遵守相关版权规定

论文标题