论文标题
从自然脚本知识中学习可转移的时空表示
Learning Transferable Spatiotemporal Representations from Natural Script Knowledge
论文作者
论文摘要
大规模视频数据的预培训已成为近年来学习可转移时空表示形式的常见配方。尽管有一些进展,但现有方法主要限于高度策划的数据集(例如K400),并且表现出不满意的开箱即用表示。我们认为,这是由于它们仅捕获像素级知识而不是时空语义的事实,从而阻碍了视频理解的进一步进展。受图像文本预训练(例如剪辑)的巨大成功的启发,我们采取了第一步来利用语言语义来提高可转移的时空表示学习。我们介绍了一个新的借口任务,转向了用于成绩单分类的视频(TVTS),该视频通过参加学习的视频表示来使ASR脚本改组。我们不依赖描述性标题,而是纯粹从视频中学习,即利用自然抄录的语音知识来提供嘈杂但随着时间的流逝有用的语义。我们的方法强制执行视觉模型,以将随着时间的推移发生的事情进行环境化,以便它可以重新组织叙事成绩单,并可以无缝地应用于现实世界中的大型未经许实的视频数据。我们的方法表明,通过线性探测,在SSV2上的视频中 +13.6%的增长 +13.6%的增长 +13.6%。该代码可在https://github.com/tencentarc/tvts上找到。
Pre-training on large-scale video data has become a common recipe for learning transferable spatiotemporal representations in recent years. Despite some progress, existing methods are mostly limited to highly curated datasets (e.g., K400) and exhibit unsatisfactory out-of-the-box representations. We argue that it is due to the fact that they only capture pixel-level knowledge rather than spatiotemporal semantics, which hinders further progress in video understanding. Inspired by the great success of image-text pre-training (e.g., CLIP), we take the first step to exploit language semantics to boost transferable spatiotemporal representation learning. We introduce a new pretext task, Turning to Video for Transcript Sorting (TVTS), which sorts shuffled ASR scripts by attending to learned video representations. We do not rely on descriptive captions and learn purely from video, i.e., leveraging the natural transcribed speech knowledge to provide noisy but useful semantics over time. Our method enforces the vision model to contextualize what is happening over time so that it can re-organize the narrative transcripts, and can seamlessly apply to large-scale uncurated video data in the real world. Our method demonstrates strong out-of-the-box spatiotemporal representations on diverse benchmarks, e.g., +13.6% gains over VideoMAE on SSV2 via linear probing. The code is available at https://github.com/TencentARC/TVTS.