演员和动作模块化网络，用于基于文本的视频细分

论文标题

演员和动作模块化网络，用于基于文本的视频细分

Actor and Action Modular Network for Text-based Video Segmentation

论文作者

Yang, Jianhua, Huang, Yan, Niu, Kai, Huang, Linjiang, Ma, Zhanyu, Wang, Liang

论文摘要

基于文本的视频细分旨在通过用文本查询指定演员及其表演动作来细分视频序列中的演员。以前的方法由于\ emph {语义不对称}的问题，根据演员及其动作以细粒度的方式将视频内容与文本查询相提并论。 \ emph {语义不对称}意味着在多模式融合过程中包含不同量的语义信息。为了减轻这个问题，我们提出了一个新颖的演员和动作模块化网络，该网络将演员及其动作分别定位在两个单独的模块中。具体来说，我们首先从视频和文本查询中学习与参与者相关的内容，然后以对称方式匹配它们以定位目标管。目标管包含所需的演员和动作，然后将其送入完全卷积的网络，以预测演员的分割掩模。我们的方法还建立了对象的关联，使其与所提出的时间提案聚合机制交叉多个框架。这使我们的方法能够有效地细分视频并保持预测的时间一致性。整个模型允许联合学习参与者的匹配和细分，并在A2D句子和J-HMDB句子数据集上实现单帧细分和完整视频细分的最新性能。

Text-based video segmentation aims to segment an actor in video sequences by specifying the actor and its performing action with a textual query. Previous methods fail to explicitly align the video content with the textual query in a fine-grained manner according to the actor and its action, due to the problem of \emph{semantic asymmetry}. The \emph{semantic asymmetry} implies that two modalities contain different amounts of semantic information during the multi-modal fusion process. To alleviate this problem, we propose a novel actor and action modular network that individually localizes the actor and its action in two separate modules. Specifically, we first learn the actor-/action-related content from the video and textual query, and then match them in a symmetrical manner to localize the target tube. The target tube contains the desired actor and action which is then fed into a fully convolutional network to predict segmentation masks of the actor. Our method also establishes the association of objects cross multiple frames with the proposed temporal proposal aggregation mechanism. This enables our method to segment the video effectively and keep the temporal consistency of predictions. The whole model is allowed for joint learning of the actor-action matching and segmentation, as well as achieves the state-of-the-art performance for both single-frame segmentation and full video segmentation on A2D Sentences and J-HMDB Sentences datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题