时间视频接地的位置回归网络

论文标题

时间视频接地的位置回归网络

Position-aware Location Regression Network for Temporal Video Grounding

论文作者

Kim, Sunoh, Yun, Kimin, Choi, Jin Young

论文摘要

视频监视成功基础的关键是了解与重要演员和物体相对应的语义短语。常规方法忽略了该短语的全面环境，或者需要对多个短语进行重大计算。为了仅用一个语义短语来了解全面的环境，我们提出了位置感知的位置回归网络（PLRN），该网络利用查询和视频的位置感知特征。具体而言，PLRN首先使用单词和视频段的位置信息对视频和查询进行编码。然后，从编码的查询中提取语义短语特征。通过反映本地和全局上下文，将语义短语功能和编码视频合并为上下文感知功能。最后，PLRN预测接地边界的开始，结束，中心和宽度值。我们的实验表明，PLRN在计算时间和内存较少的情况下，在现有方法上实现了竞争性能。

The key to successful grounding for video surveillance is to understand a semantic phrase corresponding to important actors and objects. Conventional methods ignore comprehensive contexts for the phrase or require heavy computation for multiple phrases. To understand comprehensive contexts with only one semantic phrase, we propose Position-aware Location Regression Network (PLRN) which exploits position-aware features of a query and a video. Specifically, PLRN first encodes both the video and query using positional information of words and video segments. Then, a semantic phrase feature is extracted from an encoded query with attention. The semantic phrase feature and encoded video are merged and made into a context-aware feature by reflecting local and global contexts. Finally, PLRN predicts start, end, center, and width values of a grounding boundary. Our experiments show that PLRN achieves competitive performance over existing methods with less computation time and memory.

下载PDF全文

下载文献需遵守相关版权规定

论文标题