利用长期时间动力进行视频字幕

论文标题

利用长期时间动力进行视频字幕

Exploiting long-term temporal dynamics for video captioning

论文作者

Guo, Yuyu, Zhang, Jingqiu, Gao, Lianli

论文摘要

用自然语言自动描述视频是计算机视觉和自然语言处理的基本挑战。最近，通过两个步骤实现了此问题的进展：1）采用2-D和/或3-D卷积神经网络（CNN）（例如VGG，Resnet或C3D）提取空间和/或时间特征来编码视频内容； 2）应用复发性神经网络（RNN）生成句子以描述视频中的事件。基于时间注意的模型通过考虑每个视频框架的重要性，从而获得了很大的进步。但是，对于一个长的视频，特别是对于由一组子事件组成的视频，我们应该发现并利用每个子射击而不是每个帧的重要性。在本文中，我们提出了一种新颖的方法，即时间和空间LSTM（TS-LSTM），该方法系统地利用视频序列中的空间和时间动力学。在TS-LSTM中，临时池LSTM（TP-LSTM）旨在合并空间和时间信息，以在视频子拍摄中提取长期的时间动力。并引入了堆叠的LSTM来生成一个单词列表来描述视频。在两个公共视频字幕基准中获得的实验结果表明，我们的TS-LSTM优于最新方法。

Automatically describing videos with natural language is a fundamental challenge for computer vision and natural language processing. Recently, progress in this problem has been achieved through two steps: 1) employing 2-D and/or 3-D Convolutional Neural Networks (CNNs) (e.g. VGG, ResNet or C3D) to extract spatial and/or temporal features to encode video contents; and 2) applying Recurrent Neural Networks (RNNs) to generate sentences to describe events in videos. Temporal attention-based model has gained much progress by considering the importance of each video frame. However, for a long video, especially for a video which consists of a set of sub-events, we should discover and leverage the importance of each sub-shot instead of each frame. In this paper, we propose a novel approach, namely temporal and spatial LSTM (TS-LSTM), which systematically exploits spatial and temporal dynamics within video sequences. In TS-LSTM, a temporal pooling LSTM (TP-LSTM) is designed to incorporate both spatial and temporal information to extract long-term temporal dynamics within video sub-shots; and a stacked LSTM is introduced to generate a list of words to describe the video. Experimental results obtained in two public video captioning benchmarks indicate that our TS-LSTM outperforms the state-of-the-art methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题