Merlot储备：通过视觉，语言和声音的神经脚本知识

论文标题

Merlot储备：通过视觉，语言和声音的神经脚本知识

MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound

论文作者

Zellers, Rowan, Lu, Jiasen, Lu, Ximing, Yu, Youngjae, Zhao, Yanpeng, Salehi, Mohammadreza, Kusupati, Aditya, Hessel, Jack, Farhadi, Ali, Choi, Yejin

论文摘要

作为人类，我们在一个多模式的世界中浏览，从我们所有的感官中建立了整体理解。我们介绍了Merlot Reserve，这是一种随着时间的推移而代表视频的模型 - 通过从音频，字幕和视频帧中学习的新培训目标。给定视频，我们用面具令牌替换文本和音频片段；该模型通过选择正确的蒙版片段来学习。我们的目标比替代方案更快地学习，并且表现良好：我们在2000万个YouTube视频上预计。经验结果表明，Merlot储备会学习强大的多模式表示。填补时，它将在视觉常识推理（VCR），TVQA和Kinetics-600上设置最新的。以前的工作分别超过5％，7％和1.5％。消融表明，这些任务受益于音频预处理 - 即使是VCR，也是围绕图像的质量检查任务（无声音）。此外，我们的目标可以实现开箱即用的预测，从而揭示了强烈的多模式常识理解。在一个完全零射击的设置中，我们的模型在四个视频任务上获得了竞争成果，甚至在最近提出的定位推理（Star）基准的情况下，甚至超过了监督的方法。我们分析了为什么音频可以实现更好的视觉语言表示，这为未来的研究提供了很大的机会。我们通过讨论多模式预处理的道德和社会意义来结束。

As humans, we navigate a multimodal world, building a holistic understanding from all our senses. We introduce MERLOT Reserve, a model that represents videos jointly over time -- through a new training objective that learns from audio, subtitles, and video frames. Given a video, we replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet. Our objective learns faster than alternatives, and performs well at scale: we pretrain on 20 million YouTube videos. Empirical results show that MERLOT Reserve learns strong multimodal representations. When finetuned, it sets state-of-the-art on Visual Commonsense Reasoning (VCR), TVQA, and Kinetics-600; outperforming prior work by 5%, 7%, and 1.5% respectively. Ablations show that these tasks benefit from audio pretraining -- even VCR, a QA task centered around images (without sound). Moreover, our objective enables out-of-the-box prediction, revealing strong multimodal commonsense understanding. In a fully zero-shot setting, our model obtains competitive results on four video tasks, even outperforming supervised approaches on the recently proposed Situated Reasoning (STAR) benchmark. We analyze why audio enables better vision-language representations, suggesting significant opportunities for future research. We conclude by discussing ethical and societal implications of multimodal pretraining.

下载PDF全文

下载文献需遵守相关版权规定

论文标题