视频中的视觉关系接地

论文标题

视频中的视觉关系接地

Visual Relation Grounding in Videos

论文作者

Xiao, Junbin, Shang, Xindi, Yang, Xun, Tang, Sheng, Chua, Tat-Seng

论文摘要

在本文中，我们探索了一个新颖的任务，名为视频中的视觉关系基础（VRGV）。该任务旨在以视频中的主题 - 主体对象的形式将给定关系的时空定位，以便为其他高级视频语言任务（例如，视频语言接地和视频问题答案）提供支持的视觉事实。此任务中的挑战包括但不限于：（1）主题和对象都必须在时空定位以将查询关系进行基础；（2）视频中视觉关系的时间动态性质很难捕获；（3）应在没有时空的任何直接监督的情况下实现接地。为了基于关系，我们通过通过参与和重建的关系来协作优化两个区域序列的挑战，在其中，我们进一步提出了通过视觉实体之间的空间注意力转移的消息传递机制。实验结果表明，我们的模型不仅可以胜过基线方法，还可以产生视觉上有意义的事实以支持视觉接地。（代码可在https://github.com/doc-doc/vrgv上找到）。

In this paper, we explore a novel task named visual Relation Grounding in Videos (vRGV). The task aims at spatio-temporally localizing the given relations in the form of subject-predicate-object in the videos, so as to provide supportive visual facts for other high-level video-language tasks (e.g., video-language grounding and video question answering). The challenges in this task include but not limited to: (1) both the subject and object are required to be spatio-temporally localized to ground a query relation; (2) the temporal dynamic nature of visual relations in videos is difficult to capture; and (3) the grounding should be achieved without any direct supervision in space and time. To ground the relations, we tackle the challenges by collaboratively optimizing two sequences of regions over a constructed hierarchical spatio-temporal region graph through relation attending and reconstruction, in which we further propose a message passing mechanism by spatial attention shifting between visual entities. Experimental results demonstrate that our model can not only outperform baseline approaches significantly, but also produces visually meaningful facts to support visual grounding. (Code is available at https://github.com/doc-doc/vRGV).

下载PDF全文

下载文献需遵守相关版权规定

论文标题