论文标题
对端到端视频变压器的经验研究,带有掩盖的视觉建模
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
论文作者
论文摘要
蒙版的视觉建模(MVM)最近已被证明对视觉预训练有效。虽然在视频输入(例如,蒙版框架建模)上进行了类似的重建目标,在视频语言(VIDL)预培训中探讨了以前的研究,但以前的研究未能找到真正有效的MVM策略,可以在很大程度上使下游性能受益。在这项工作中,我们系统地检查了MVM在VIDL学习的背景下的潜力。具体而言,我们的研究基于完全端到端的视频变压器(Violet),其中MVM培训的监督可以反向视频像素空间。总共探索了MVM的八个不同的重建目标,从低级像素值和定向梯度到高级深度图,光流,离散的视觉令牌和潜在的视觉特征。我们进行全面的实验,并提供有关导致有效MVM培训的因素的见解,从而增强了模型VioletV2。从经验上讲,我们展示了通过MVM目标预先培训的VioletV2,可以在13个VIDL基准测试中取得显着改进,从视频问题回答,视频字幕到文本到视频检索等等。
Masked visual modeling (MVM) has been recently proven effective for visual pre-training. While similar reconstructive objectives on video inputs (e.g., masked frame modeling) have been explored in video-language (VidL) pre-training, previous studies fail to find a truly effective MVM strategy that can largely benefit the downstream performance. In this work, we systematically examine the potential of MVM in the context of VidL learning. Specifically, we base our study on a fully end-to-end VIdeO-LanguagE Transformer (VIOLET), where the supervision from MVM training can be backpropagated to the video pixel space. In total, eight different reconstructive targets of MVM are explored, from low-level pixel values and oriented gradients to high-level depth maps, optical flow, discrete visual tokens, and latent visual features. We conduct comprehensive experiments and provide insights into the factors leading to effective MVM training, resulting in an enhanced model VIOLETv2. Empirically, we show VIOLETv2 pre-trained with MVM objective achieves notable improvements on 13 VidL benchmarks, ranging from video question answering, video captioning, to text-to-video retrieval.