薄雾：长格式视频问题回答的多模式迭代时空变压器

论文标题

薄雾：长格式视频问题回答的多模式迭代时空变压器

MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering

论文作者

Gao, Difei, Zhou, Luowei, Ji, Lei, Zhu, Linchao, Yang, Yi, Shou, Mike Zheng

论文摘要

为了构建能够协助人类日常活动的视频答案（videoqa）系统，必须从各种各样且复杂的事件中寻求答案。现有的多模式VQA模型在图像或简短的视频剪辑上实现了有希望的性能，尤其是在大型多模式预训练的最新成功中。但是，将这些方法扩展到长期视频时，会出现新的挑战。一方面，使用密集的视频采样策略在计算上是过于刺激的。另一方面，在需要多事件和多粒性视觉推理的情况下依靠稀疏采样斗争的方法。在这项工作中，我们介绍了一个名为多模式迭代时空变压器（MIST）的新模型，以更好地适应长格式VideoQA的预训练模型。具体而言，雾将传统的密集时空自我注意分解为级联的段和区域选择模块，这些模块可适应地选择与问题本身密切相关的框架和图像区域。然后通过注意模块有效地处理不同粒度的视觉概念。此外，雾迭代对多层进行选择和关注，以支持多个事件的推理。在包括AGQA，Next-QA，Star和Env-QA在内的四个VideoQA数据集上的实验结果表明，MIST可以实现最新的性能，并且在计算效率和解释性方面都表现出色。

To build Video Question Answering (VideoQA) systems capable of assisting humans in daily activities, seeking answers from long-form videos with diverse and complex events is a must. Existing multi-modal VQA models achieve promising performance on images or short video clips, especially with the recent success of large-scale multi-modal pre-training. However, when extending these methods to long-form videos, new challenges arise. On the one hand, using a dense video sampling strategy is computationally prohibitive. On the other hand, methods relying on sparse sampling struggle in scenarios where multi-event and multi-granularity visual reasoning are required. In this work, we introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA. Specifically, MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules that adaptively select frames and image regions that are closely relevant to the question itself. Visual concepts at different granularities are then processed efficiently through an attention module. In addition, MIST iteratively conducts selection and attention over multiple layers to support reasoning over multiple events. The experimental results on four VideoQA datasets, including AGQA, NExT-QA, STAR, and Env-QA, show that MIST achieves state-of-the-art performance and is superior at computation efficiency and interpretability.

下载PDF全文

下载文献需遵守相关版权规定

论文标题