时间句子接地的分层本地全球变压器

论文标题

时间句子接地的分层本地全球变压器

Hierarchical Local-Global Transformer for Temporal Sentence Grounding

论文作者

Fang, Xiang, Liu, Daizong, Zhou, Pan, Xu, Zichuan, Li, Ruixuan

论文摘要

本文研究了时间句子接地的多媒体问题（TSG），该问题旨在根据给定的句子查询准确地确定未修剪视频中的特定视频段。传统的TSG方法主要遵循自上而下或自下而上的框架，而不是端到端。他们严重依靠耗时的后处理来完善接地结果。最近，提出了一些基于变压器的方法来有效地对视频和查询之间的细粒语义对齐进行建模。尽管这些方法在某种程度上达到了显着的性能，但它们同样将视频的框架和查询的单词视为用于关联的变压器输入，未能捕获其不同水平的粒度与独特的语义。为了解决这个问题，在本文中，我们提出了一种新型的等级局部 - 全球变压器（HLGT），以利用这种层次结构信息，并建模不同级别的粒度和不同模态之间的相互作用，以学习更多细粒度的多模式表示。具体而言，我们首先将视频和查询分为单个剪辑和短语，以通过时间变压器学习其本地上下文（相邻依赖关系）和全局相关性（远程依赖）。然后，引入了全球本地变压器，以学习本地级别和全球级别语义之间的相互作用，以获得更好的多模式推理。此外，我们开发了一种新的跨模式循环一致性损失，以在两种模式之间实施相互作用，并鼓励它们之间的语义一致性。最后，我们设计了一个全新的跨模式平行变压器解码器，以整合编码的视觉和文本特征以进行最终接地。在三个具有挑战性的数据集上进行了广泛的实验表明，我们提出的HLGT实现了新的最新性能。

This paper studies the multimedia problem of temporal sentence grounding (TSG), which aims to accurately determine the specific video segment in an untrimmed video according to a given sentence query. Traditional TSG methods mainly follow the top-down or bottom-up framework and are not end-to-end. They severely rely on time-consuming post-processing to refine the grounding results. Recently, some transformer-based approaches are proposed to efficiently and effectively model the fine-grained semantic alignment between video and query. Although these methods achieve significant performance to some extent, they equally take frames of the video and words of the query as transformer input for correlating, failing to capture their different levels of granularity with distinct semantics. To address this issue, in this paper, we propose a novel Hierarchical Local-Global Transformer (HLGT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities for learning more fine-grained multi-modal representations. Specifically, we first split the video and query into individual clips and phrases to learn their local context (adjacent dependency) and global correlation (long-range dependency) via a temporal transformer. Then, a global-local transformer is introduced to learn the interactions between the local-level and global-level semantics for better multi-modal reasoning. Besides, we develop a new cross-modal cycle-consistency loss to enforce interaction between two modalities and encourage the semantic alignment between them. Finally, we design a brand-new cross-modal parallel transformer decoder to integrate the encoded visual and textual features for final grounding. Extensive experiments on three challenging datasets show that our proposed HLGT achieves a new state-of-the-art performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题