论文标题
Merlot储备:通过视觉,语言和声音的神经脚本知识
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
论文作者
论文摘要
作为人类,我们在一个多模式的世界中浏览,从我们所有的感官中建立了整体理解。我们介绍了Merlot Reserve,这是一种随着时间的推移而代表视频的模型 - 通过从音频,字幕和视频帧中学习的新培训目标。给定视频,我们用面具令牌替换文本和音频片段;该模型通过选择正确的蒙版片段来学习。我们的目标比替代方案更快地学习,并且表现良好:我们在2000万个YouTube视频上预计。 经验结果表明,Merlot储备会学习强大的多模式表示。填补时,它将在视觉常识推理(VCR),TVQA和Kinetics-600上设置最新的。以前的工作分别超过5%,7%和1.5%。消融表明,这些任务受益于音频预处理 - 即使是VCR,也是围绕图像的质量检查任务(无声音)。此外,我们的目标可以实现开箱即用的预测,从而揭示了强烈的多模式常识理解。在一个完全零射击的设置中,我们的模型在四个视频任务上获得了竞争成果,甚至在最近提出的定位推理(Star)基准的情况下,甚至超过了监督的方法。 我们分析了为什么音频可以实现更好的视觉语言表示,这为未来的研究提供了很大的机会。我们通过讨论多模式预处理的道德和社会意义来结束。
As humans, we navigate a multimodal world, building a holistic understanding from all our senses. We introduce MERLOT Reserve, a model that represents videos jointly over time -- through a new training objective that learns from audio, subtitles, and video frames. Given a video, we replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet. Our objective learns faster than alternatives, and performs well at scale: we pretrain on 20 million YouTube videos. Empirical results show that MERLOT Reserve learns strong multimodal representations. When finetuned, it sets state-of-the-art on Visual Commonsense Reasoning (VCR), TVQA, and Kinetics-600; outperforming prior work by 5%, 7%, and 1.5% respectively. Ablations show that these tasks benefit from audio pretraining -- even VCR, a QA task centered around images (without sound). Moreover, our objective enables out-of-the-box prediction, revealing strong multimodal commonsense understanding. In a fully zero-shot setting, our model obtains competitive results on four video tasks, even outperforming supervised approaches on the recently proposed Situated Reasoning (STAR) benchmark. We analyze why audio enables better vision-language representations, suggesting significant opportunities for future research. We conclude by discussing ethical and societal implications of multimodal pretraining.