论文标题

locvtp:视频文本预培训用于时间定位

LocVTP: Video-Text Pre-training for Temporal Localization

论文作者

Cao, Meng, Yang, Tianyu, Weng, Junwu, Zhang, Can, Wang, Jue, Zou, Yuexian

论文摘要

视频文本预培训(VTP)旨在从大规模的网络视频中学习可转移的代表。迄今为止,几乎所有现有的VTP方法都仅限于基于检索的下游任务,例如视频检索,而它们在基于本地化的任务上的转移潜力,例如时间基础,不足以探索。在本文中,我们通过实验分析并证明了当前VTP方法与本地化任务的不兼容,并提出了一种新型面向定位的视频文本预训练框架,被称为LocvTP。具体而言,我们执行细粒对比度对准作为对粗粒的对应关系的补充。为了进一步增强学习功能的时间推理能力,我们提出了一个上下文投影头和暂时意识的对比损失,以感知上下文关系。对六个数据集的四个下游任务进行的广泛实验表明,我们的LOCVTP在基于检索和基于本地化的任务上都实现了最先进的性能。此外,我们进行了全面的消融研究和彻底的分析,以探索最佳的模型设计和培训策略。

Video-Text Pre-training (VTP) aims to learn transferable representations for various downstream tasks from large-scale web videos. To date, almost all existing VTP methods are limited to retrieval-based downstream tasks, e.g., video retrieval, whereas their transfer potentials on localization-based tasks, e.g., temporal grounding, are under-explored. In this paper, we experimentally analyze and demonstrate the incompatibility of current VTP methods with localization tasks, and propose a novel Localization-oriented Video-Text Pre-training framework, dubbed as LocVTP. Specifically, we perform the fine-grained contrastive alignment as a complement to the coarse-grained one by a clip-word correspondence discovery scheme. To further enhance the temporal reasoning ability of the learned feature, we propose a context projection head and a temporal aware contrastive loss to perceive the contextual relationships. Extensive experiments on four downstream tasks across six datasets demonstrate that our LocVTP achieves state-of-the-art performance on both retrieval-based and localization-based tasks. Furthermore, we conduct comprehensive ablation studies and thorough analyses to explore the optimum model designs and training strategies.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源