论文标题
浏览,定位,然后仔细阅读:类似人类的自然语言视频本地化框架
Skimming, Locating, then Perusing: A Human-Like Framework for Natural Language Video Localization
论文作者
论文摘要
本文解决了自然语言视频本地化(NLVL)的问题。几乎所有现有的作品都遵循“仅一次外观”框架,该框架利用单个模型直接捕获视频疑问对之间的复杂跨和自模式关系并检索相关段。但是,我们认为这些方法忽略了一个理想本地化方法的两个必不可少的特征:1)帧差异:考虑正/负视频帧的不平衡,在本地化过程中强调正帧并削弱负面框架是有效的。 2)边界优先:为了预测确切的段边界,该模型应捕获连续帧之间更细粒度的差异,因为它们的变化通常是平滑的。为此,我们灵感来自于人类如何看待和定位一个细分市场,我们提出了一个两步的人类框架,称为掠夺 - 储存 - 融合(SLP)。 SLP由一个脱位和排列的(SL)模块和双向细胞(BP)模块组成。 SL模块首先是指查询语义,并在滤除无关的帧时从视频中选择最佳匹配的帧。然后,BP模块基于此框架构造了初始段,并通过探索其相邻帧来动态更新它,直到没有帧共享相同的活动语义为止。三个具有挑战性的基准测试的实验结果表明,我们的SLP优于最新方法,并将更精确的段边界定位。
This paper addresses the problem of natural language video localization (NLVL). Almost all existing works follow the "only look once" framework that exploits a single model to directly capture the complex cross- and self-modal relations among video-query pairs and retrieve the relevant segment. However, we argue that these methods have overlooked two indispensable characteristics of an ideal localization method: 1) Frame-differentiable: considering the imbalance of positive/negative video frames, it is effective to highlight positive frames and weaken negative ones during the localization. 2) Boundary-precise: to predict the exact segment boundary, the model should capture more fine-grained differences between consecutive frames since their variations are often smooth. To this end, inspired by how humans perceive and localize a segment, we propose a two-step human-like framework called Skimming-Locating-Perusing (SLP). SLP consists of a Skimming-and-Locating (SL) module and a Bi-directional Perusing (BP) module. The SL module first refers to the query semantic and selects the best matched frame from the video while filtering out irrelevant frames. Then, the BP module constructs an initial segment based on this frame, and dynamically updates it by exploring its adjacent frames until no frame shares the same activity semantic. Experimental results on three challenging benchmarks show that our SLP is superior to the state-of-the-art methods and localizes more precise segment boundaries.