论文标题
共同基于查询的力矩本地化的共同跨和自模式图表网络
Jointly Cross- and Self-Modal Graph Attention Network for Query-Based Moment Localization
论文作者
论文摘要
基于查询的力矩本地化是一项新任务,它根据给定的句子查询本地定位未修剪视频中最佳匹配段。在此本地化任务中,应该更多地关注彻底挖掘视觉和语言信息。为此,我们提出了一个新颖的跨和自模式图形注意网络(CSMGAN),该网络将此任务重新制定为经过关节图的迭代消息的过程。具体而言,关节图由跨模式相互作用图(CMG)和自模式关系图(SMG)组成,其中帧和单词表示为节点,并且通过注意机制描述了跨 - 和自模式节点对之间的关系。通过参数消息传递,CMG突出显示了视频和句子之间的相关实例,然后SMG模拟了每个模态内部的成对关系(word)相关。有了这样的联合图的多层,我们的CSMGAN能够有效地捕获两种方式之间的高阶相互作用,从而实现了进一步的精确定位。此外,为了更好地理解查询中的上下文细节,我们开发了一个层次句子编码器来增强查询理解。在四个公共数据集上进行的广泛实验证明了我们提出的模型的有效性,而GCSMAN极大地表现出了最先进的方法。
Query-based moment localization is a new task that localizes the best matched segment in an untrimmed video according to a given sentence query. In this localization task, one should pay more attention to thoroughly mine visual and linguistic information. To this end, we propose a novel Cross- and Self-Modal Graph Attention Network (CSMGAN) that recasts this task as a process of iterative messages passing over a joint graph. Specifically, the joint graph consists of Cross-Modal interaction Graph (CMG) and Self-Modal relation Graph (SMG), where frames and words are represented as nodes, and the relations between cross- and self-modal node pairs are described by an attention mechanism. Through parametric message passing, CMG highlights relevant instances across video and sentence, and then SMG models the pairwise relation inside each modality for frame (word) correlating. With multiple layers of such a joint graph, our CSMGAN is able to effectively capture high-order interactions between two modalities, thus enabling a further precise localization. Besides, to better comprehend the contextual details in the query, we develop a hierarchical sentence encoder to enhance the query understanding. Extensive experiments on four public datasets demonstrate the effectiveness of our proposed model, and GCSMAN significantly outperforms the state-of-the-arts.