Vindlu：有效的视频和语言预处理的食谱

论文标题

Vindlu：有效的视频和语言预处理的食谱

VindLU: A Recipe for Effective Video-and-Language Pretraining

论文作者

Cheng, Feng, Wang, Xizi, Lei, Jie, Crandall, David, Bansal, Mohit, Bertasius, Gedas

论文摘要

最近几年见证了视频和语言（VIDL）的显着进步。但是，大多数现代的VIDL方法都使用复杂和专业的模型架构和复杂的预审预后协议，从而使这些框架的可重复性，分析和比较变得困难。因此，本文没有提出另一个新的VIDL模型，而是进行了一项彻底的经验研究，揭开了VIDL模型设计中最重要的因素的神秘面纱。我们研究的因素包括（i）时空体系结构设计，（ii）多模式融合方案，（iii）预处理目标，（iv）选择预处理数据，（v）预处理和鉴定方案，以及（vi）数据集和模型扩展。我们的实证研究表明，最重要的设计因素包括：时间建模，视频到文本多模式融合，蒙版建模目标以及图像和视频的联合培训。然后，我们使用这些经验见解，然后开发出一种被称为Vindlu的分步食谱，以进行有效的VIDL预处理。我们使用我们的食谱训练的最终模型比在几个VIDL任务上的最先进的结果获得了可比性或更好的成就，而无需依赖外部剪辑预处理。特别是，在文本到视频检索任务上，我们的方法在DIDEMO上获得了61.2％，而活动网络上的方法为55.0％，表现分别超过7.8％和6.1％。此外，我们的模型还获得了有关ActivityNet-QA，MSRVTT-QA，MSRVTT-MC和TVQA的最新视频提问结果。我们的代码和预估计的模型可在以下网址公开获取：https：//github.com/klauscc/vindlu。

The last several years have witnessed remarkable progress in video-and-language (VidL) understanding. However, most modern VidL approaches use complex and specialized model architectures and sophisticated pretraining protocols, making the reproducibility, analysis and comparisons of these frameworks difficult. Hence, instead of proposing yet another new VidL model, this paper conducts a thorough empirical study demystifying the most important factors in the VidL model design. Among the factors that we investigate are (i) the spatiotemporal architecture design, (ii) the multimodal fusion schemes, (iii) the pretraining objectives, (iv) the choice of pretraining data, (v) pretraining and finetuning protocols, and (vi) dataset and model scaling. Our empirical study reveals that the most important design factors include: temporal modeling, video-to-text multimodal fusion, masked modeling objectives, and joint training on images and videos. Using these empirical insights, we then develop a step-by-step recipe, dubbed VindLU, for effective VidL pretraining. Our final model trained using our recipe achieves comparable or better than state-of-the-art results on several VidL tasks without relying on external CLIP pretraining. In particular, on the text-to-video retrieval task, our approach obtains 61.2% on DiDeMo, and 55.0% on ActivityNet, outperforming current SOTA by 7.8% and 6.1% respectively. Furthermore, our model also obtains state-of-the-art video question-answering results on ActivityNet-QA, MSRVTT-QA, MSRVTT-MC and TVQA. Our code and pretrained models are publicly available at: https://github.com/klauscc/VindLU.

下载PDF全文

下载文献需遵守相关版权规定

论文标题