论文标题
Intervideo:通用视频基础模型通过生成和歧视性学习
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
论文作者
论文摘要
基础模型最近在计算机视觉中的各种下游任务上表现出了出色的性能。但是,大多数现有的视觉基础模型只是专注于图像级预处理和辅助,这些模型仅限于动态和复杂的视频级别理解任务。为了填补空白,我们通过利用生成和歧视性的自我监督视频学习来提出一般视频基础模型,即Internvideo。具体而言,Intervideo有效地探索了蒙版的视频建模和视频对比学习作为审前的目标,并以可学习的方式选择性地协调这两个互补框架的视频表示形式,以增强各种视频应用程序。 Intervideo没有铃铛和口哨声,可以从39个视频数据集上实现最新的性能,包括视频动作识别/检测,视频语言对准和开放世界的视频应用程序。尤其是,我们的方法分别可以在具有挑战性的动力学400和其他V2基准上获得91.1%和77.2%的TOP-1准确性。所有这些结果有效地表明了我们的Intervideo的一般性,以了解视频理解。该代码将在https://github.com/opengvlab/internvideo上发布。
The foundation models have recently shown excellent performance on a variety of downstream tasks in computer vision. However, most existing vision foundation models simply focus on image-level pretraining and adpation, which are limited for dynamic and complex video-level understanding tasks. To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning. Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications. Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications. Especially, our methods can obtain 91.1% and 77.2% top-1 accuracy on the challenging Kinetics-400 and Something-Something V2 benchmarks, respectively. All of these results effectively show the generality of our InternVideo for video understanding. The code will be released at https://github.com/OpenGVLab/InternVideo .