论文标题
语言指导网络用于跨模式时刻检索
Language Guided Networks for Cross-modal Moment Retrieval
论文作者
论文摘要
我们解决了跨模式时刻检索的具有挑战性的任务,该任务旨在从自然语言查询描述的未修剪视频中定位一个时间段。它在视觉和语言领域之间适当的语义一致性上构成了巨大的挑战。现有方法独立提取视频和句子的特征,并纯粹利用嵌入在多模式融合阶段的句子,这些句子无法充分利用语言的潜力。在本文中,我们提出语言指导网络(LGN),这是一个新框架,利用嵌入句子来指导整个时刻检索过程。在第一个特征提取阶段,我们建议共同学习视觉和语言特征,以捕获强大的视觉信息,这些视觉信息可以涵盖句子查询中的复杂语义。具体而言,早期调制单元旨在通过语言嵌入来调节视觉特征提取器的特征图。然后,我们在第二个融合阶段采用多模式融合模块。最后,为了获得精确的本地化,句子信息被用来指导预测时间位置的过程。具体而言,开发了晚期指南模块,以通过通道注意机制线性地转换定位网络的输出。两个流行数据集的实验结果表明,我们提出的方法在矩检测中的出色性能(根据charades-sta上的[email protected]提高了5.8 \%,在炸玉米饼上提高了5.2 \%)。完整系统的源代码将公开可用。
We address the challenging task of cross-modal moment retrieval, which aims to localize a temporal segment from an untrimmed video described by a natural language query. It poses great challenges over the proper semantic alignment between vision and linguistic domains. Existing methods independently extract the features of videos and sentences and purely utilize the sentence embedding in the multi-modal fusion stage, which do not make full use of the potential of language. In this paper, we present Language Guided Networks (LGN), a new framework that leverages the sentence embedding to guide the whole process of moment retrieval. In the first feature extraction stage, we propose to jointly learn visual and language features to capture the powerful visual information which can cover the complex semantics in the sentence query. Specifically, the early modulation unit is designed to modulate the visual feature extractor's feature maps by a linguistic embedding. Then we adopt a multi-modal fusion module in the second fusion stage. Finally, to get a precise localizer, the sentence information is utilized to guide the process of predicting temporal positions. Specifically, the late guidance module is developed to linearly transform the output of localization networks via the channel attention mechanism. The experimental results on two popular datasets demonstrate the superior performance of our proposed method on moment retrieval (improving by 5.8\% in terms of [email protected] on Charades-STA and 5.2\% on TACoS). The source code for the complete system will be publicly available.