论文标题

弱监督的多层次注意重建网络,用于接地视频中的文本查询

Weakly-Supervised Multi-Level Attentional Reconstruction Network for Grounding Textual Queries in Videos

论文作者

Song, Yijun, Wang, Jingwen, Ma, Lin, Yu, Zhou, Yu, Jun

论文摘要

视频中暂时接地文本查询的任务是将语义上与给定查询相对应的一个视频段定位。现有的大多数方法都依赖于细分句子对(时间注释)进行培训,而这些方法通常在现实世界中不可用。在这项工作中,我们提出了一个有效的弱监督模型,称为多层注意重建网络(MARN),该模型仅依赖于训练阶段的视频句子对。所提出的方法利用了注意重建的思想,并直接通过学习的建议级别的注意力对候选段进行评分。此外,还利用了另一个分支学习剪辑级的注意,以在训练阶段和测试阶段完善建议。我们开发了一种新颖的提案抽样机制来利用苏格里内信息来学习更好的建议表示,并采用2D卷积来利用宣传线索,以学习可靠的注意力图。关于Charades-STA和ActivityNet捕获数据集的实验证明了我们Marn比现有弱监督方法的优越性。

The task of temporally grounding textual queries in videos is to localize one video segment that semantically corresponds to the given query. Most of the existing approaches rely on segment-sentence pairs (temporal annotations) for training, which are usually unavailable in real-world scenarios. In this work we present an effective weakly-supervised model, named as Multi-Level Attentional Reconstruction Network (MARN), which only relies on video-sentence pairs during the training stage. The proposed method leverages the idea of attentional reconstruction and directly scores the candidate segments with the learnt proposal-level attentions. Moreover, another branch learning clip-level attention is exploited to refine the proposals at both the training and testing stage. We develop a novel proposal sampling mechanism to leverage intra-proposal information for learning better proposal representation and adopt 2D convolution to exploit inter-proposal clues for learning reliable attention map. Experiments on Charades-STA and ActivityNet-Captions datasets demonstrate the superiority of our MARN over the existing weakly-supervised methods.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源