论文标题
视频接地的密集回归网络
Dense Regression Network for Video Grounding
论文作者
论文摘要
我们解决了自然语言查询的视频基础问题。此任务中的主要挑战是,一个培训视频可能只包含一些注释的启动/结束框架,这些帧可以用作模型培训的积极示例。大多数常规方法使用此类不平衡数据直接训练二进制分类器,从而实现了劣质的结果。本文的关键思想是使用地面真理中的框架与启动(结尾)框架之间的距离作为密集的监督,以提高视频接地准确性。具体而言,我们设计了一个新颖的致密回归网络(DRN),以回归查询描述的视频段的每个帧到启动(结尾)框架的距离。我们还提出了一个简单但有效的回归头模块,以明确考虑接地结果的定位质量(即,在预测的位置和地面真相之间)。实验结果表明,我们的方法在三个数据集(即Charades-STA,ActivityNet-Captions和Tacos)上的最先进表现明显优于最先进的方法。
We address the problem of video grounding from natural language queries. The key challenge in this task is that one training video might only contain a few annotated starting/ending frames that can be used as positive examples for model training. Most conventional approaches directly train a binary classifier using such imbalance data, thus achieving inferior results. The key idea of this paper is to use the distances between the frame within the ground truth and the starting (ending) frame as dense supervisions to improve the video grounding accuracy. Specifically, we design a novel dense regression network (DRN) to regress the distances from each frame to the starting (ending) frame of the video segment described by the query. We also propose a simple but effective IoU regression head module to explicitly consider the localization quality of the grounding results (i.e., the IoU between the predicted location and the ground truth). Experimental results show that our approach significantly outperforms state-of-the-arts on three datasets (i.e., Charades-STA, ActivityNet-Captions, and TACoS).