论文标题
对比性语言行动预训练的时间定位
Contrastive Language-Action Pre-training for Temporal Localization
论文作者
论文摘要
长期视频理解需要能够在时间上定位活动或语言的设计方法。此类任务的端到端培训受到计算设备内存限制的限制,并且在大规模上缺乏时间注释。这些限制可以通过在按类注释监督的时间修剪的大型视频的大型数据集上进行预培训来解决。一旦预先训练了视频编码器,在微调过程中将其冻结是普遍的做法。因此,视频编码器不学习时间边界和看不见的类,导致有关下游任务的域间隙。此外,使用暂时修剪的视频可防止视频剪辑中不同动作类别和背景上下文之间的关系,从而导致概括能力有限。为了解决这些局限性,我们提出了一种新颖的预先训练方法,而不会冷冻视频编码器来利用语言。我们介绍了掩盖的对比学习损失,以用字幕形式捕获活动,背景视频剪辑和语言之间的Visio语言关系。我们的实验表明,所提出的方法改善了时间动作本地化,几乎没有时间动作定位和视频语言接地任务的最新方法。
Long-form video understanding requires designing approaches that are able to temporally localize activities or language. End-to-end training for such tasks is limited by the compute device memory constraints and lack of temporal annotations at large-scale. These limitations can be addressed by pre-training on large datasets of temporally trimmed videos supervised by class annotations. Once the video encoder is pre-trained, it is common practice to freeze it during fine-tuning. Therefore, the video encoder does not learn temporal boundaries and unseen classes, causing a domain gap with respect to the downstream tasks. Moreover, using temporally trimmed videos prevents to capture the relations between different action categories and the background context in a video clip which results in limited generalization capacity. To address these limitations, we propose a novel post-pre-training approach without freezing the video encoder which leverages language. We introduce a masked contrastive learning loss to capture visio-linguistic relations between activities, background video clips and language in the form of captions. Our experiments show that the proposed approach improves the state-of-the-art on temporal action localization, few-shot temporal action localization, and video language grounding tasks.