论文标题
利用长期时间动力进行视频字幕
Exploiting long-term temporal dynamics for video captioning
论文作者
论文摘要
用自然语言自动描述视频是计算机视觉和自然语言处理的基本挑战。最近,通过两个步骤实现了此问题的进展:1)采用2-D和/或3-D卷积神经网络(CNN)(例如VGG,Resnet或C3D)提取空间和/或时间特征来编码视频内容; 2)应用复发性神经网络(RNN)生成句子以描述视频中的事件。基于时间注意的模型通过考虑每个视频框架的重要性,从而获得了很大的进步。但是,对于一个长的视频,特别是对于由一组子事件组成的视频,我们应该发现并利用每个子射击而不是每个帧的重要性。在本文中,我们提出了一种新颖的方法,即时间和空间LSTM(TS-LSTM),该方法系统地利用视频序列中的空间和时间动力学。在TS-LSTM中,临时池LSTM(TP-LSTM)旨在合并空间和时间信息,以在视频子拍摄中提取长期的时间动力。并引入了堆叠的LSTM来生成一个单词列表来描述视频。在两个公共视频字幕基准中获得的实验结果表明,我们的TS-LSTM优于最新方法。
Automatically describing videos with natural language is a fundamental challenge for computer vision and natural language processing. Recently, progress in this problem has been achieved through two steps: 1) employing 2-D and/or 3-D Convolutional Neural Networks (CNNs) (e.g. VGG, ResNet or C3D) to extract spatial and/or temporal features to encode video contents; and 2) applying Recurrent Neural Networks (RNNs) to generate sentences to describe events in videos. Temporal attention-based model has gained much progress by considering the importance of each video frame. However, for a long video, especially for a video which consists of a set of sub-events, we should discover and leverage the importance of each sub-shot instead of each frame. In this paper, we propose a novel approach, namely temporal and spatial LSTM (TS-LSTM), which systematically exploits spatial and temporal dynamics within video sequences. In TS-LSTM, a temporal pooling LSTM (TP-LSTM) is designed to incorporate both spatial and temporal information to extract long-term temporal dynamics within video sub-shots; and a stacked LSTM is introduced to generate a list of words to describe the video. Experimental results obtained in two public video captioning benchmarks indicate that our TS-LSTM outperforms the state-of-the-art methods.