学习常识感知力矩 - 文本对齐，以快速视频时间基础

论文标题

学习常识感知力矩 - 文本对齐，以快速视频时间基础

Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal Grounding

论文作者

Wu, Ziyue, Gao, Junyu, Huang, Shucheng, Xu, Changsheng

论文摘要

在自然语言查询中描述的，有效，有效地介绍的时间视频片段是视觉和语言领域所需的至关重要的能力。在本文中，我们处理快速视频时间基础（FVTG）任务，旨在以高速和有利的精度定位目标细分市场。大多数现有的方法采用精心设计的跨模式交互模块来改善受测试时间瓶颈的接地性能。尽管在推断期间，几种常见的基于空间的方法享有高速优点，但它们几乎无法捕获视觉和文本方式之间的全面和明确关系。在本文中，为了应对速度准确性权衡的困境，我们提出了一个认识性的交叉模式对齐（CCA）框架，该框架将共注引导的视觉和文本表示形式纳入了快速视频接地的互补共同空间。具体而言，通过从语言语料库中提取结构性语义信息来探索和利用常识概念。然后，通过利用学习的常识概念来获得共识感知的交互模块，以获取桥接的视觉和文本特征。最后，为了维护文本查询的原始语义信息，优化了跨模式互补的共同空间，以获得执行FVTG的匹配分数。两个具有挑战性的基准的广泛结果表明，我们的CCA方法在高速运行时对最先进的方法表现出色。我们的代码可在https://github.com/ziyuewu59/cca上找到。

Grounding temporal video segments described in natural language queries effectively and efficiently is a crucial capability needed in vision-and-language fields. In this paper, we deal with the fast video temporal grounding (FVTG) task, aiming at localizing the target segment with high speed and favorable accuracy. Most existing approaches adopt elaborately designed cross-modal interaction modules to improve the grounding performance, which suffer from the test-time bottleneck. Although several common space-based methods enjoy the high-speed merit during inference, they can hardly capture the comprehensive and explicit relations between visual and textual modalities. In this paper, to tackle the dilemma of speed-accuracy tradeoff, we propose a commonsense-aware cross-modal alignment (CCA) framework, which incorporates commonsense-guided visual and text representations into a complementary common space for fast video temporal grounding. Specifically, the commonsense concepts are explored and exploited by extracting the structural semantic information from a language corpus. Then, a commonsense-aware interaction module is designed to obtain bridged visual and text features by utilizing the learned commonsense concepts. Finally, to maintain the original semantic information of textual queries, a cross-modal complementary common space is optimized to obtain matching scores for performing FVTG. Extensive results on two challenging benchmarks show that our CCA method performs favorably against state-of-the-arts while running at high speed. Our code is available at https://github.com/ZiyueWu59/CCA.

下载PDF全文

下载文献需遵守相关版权规定

论文标题