将多概念的注意力转化为视频摘要

论文标题

将多概念的注意力转化为视频摘要

Transforming Multi-Concept Attention into Video Summarization

论文作者

Liu, Yen-Ting, Li, Yu-Jhe, Wang, Yu-Chiang Frank

论文摘要

视频摘要是计算机视觉中具有挑战性的任务之一，该任务旨在识别冗长的视频输入中的突出显示框架或镜头。在本文中，我们提出了一个基于复杂的视频数据的新型基于注意力的框架，用于视频摘要。与以前仅应用注意力机制对应关系的作品不同，我们的多概念视频自我注意力（MC-VSA）模型被提出，以确定跨时间和概念视频特征的信息区域，这些区域共同利用上下文多样性，而不是时间和空间，以进行汇总。连同视频和在我们的框架中执行的摘要之间的一致性，我们的模型可以应用于标签和未标记的数据，从而使我们的方法比实际应用程序更可取。在两个基准上进行的广泛而完整的实验证明了我们模型的有效性，既有定量和定性，又证实了其优越性比在现状。

Video summarization is among challenging tasks in computer vision, which aims at identifying highlight frames or shots over a lengthy video input. In this paper, we propose an novel attention-based framework for video summarization with complex video data. Unlike previous works which only apply attention mechanism on the correspondence between frames, our multi-concept video self-attention (MC-VSA) model is presented to identify informative regions across temporal and concept video features, which jointly exploit context diversity over time and space for summarization purposes. Together with consistency between video and summary enforced in our framework, our model can be applied to both labeled and unlabeled data, making our method preferable to real-world applications. Extensive and complete experiments on two benchmarks demonstrate the effectiveness of our model both quantitatively and qualitatively, and confirms its superiority over the stateof-the-arts.

下载PDF全文

下载文献需遵守相关版权规定

论文标题