实体感知和运动感知的变压器，用于视频中语言驱动的动作本地化

论文标题

实体感知和运动感知的变压器，用于视频中语言驱动的动作本地化

Entity-aware and Motion-aware Transformers for Language-driven Action Localization in Videos

论文作者

Yang, Shuo, Wu, Xinxiao

论文摘要

视频中以语言驱动的动作定位是一项具有挑战性的任务，不仅涉及视觉语言匹配，还涉及动作边界预测。通过将语言查询与视频段保持一致，已经取得了最新的进展，但是估计精确的边界仍未探索。在本文中，我们提出了实体感知和运动感知的变压器，该变压器通过先定位带有实体查询的剪辑，然后在带有运动查询的缩水时间区域中精心预测视频中的动作。实体感知的变压器将文本实体通过跨模式和跨框架的关注纳入视觉表示学习中，以促进参加与动作相关的视频剪辑。运动吸引力的变压器通过将长期记忆整合到自我发项式模块中，以进一步提高动作边界预测的精度，从而在多个时间尺度上捕获细粒度的运动变化。 Charades-STA和TACOS数据集的广泛实验表明，我们的方法比现有方法更好。

Language-driven action localization in videos is a challenging task that involves not only visual-linguistic matching but also action boundary prediction. Recent progress has been achieved through aligning language query to video segments, but estimating precise boundaries is still under-explored. In this paper, we propose entity-aware and motion-aware Transformers that progressively localizes actions in videos by first coarsely locating clips with entity queries and then finely predicting exact boundaries in a shrunken temporal region with motion queries. The entity-aware Transformer incorporates the textual entities into visual representation learning via cross-modal and cross-frame attentions to facilitate attending action-related video clips. The motion-aware Transformer captures fine-grained motion changes at multiple temporal scales via integrating long short-term memory into the self-attention module to further improve the precision of action boundary prediction. Extensive experiments on the Charades-STA and TACoS datasets demonstrate that our method achieves better performance than existing methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题