论文标题
视频字幕的层次记忆解码
Hierarchical Memory Decoding for Video Captioning
论文作者
论文摘要
视频字幕的最新进展通常采用经常性神经网络(RNN)作为解码器。但是,RNN容易稀释长期信息。最近的工作表明,内存网络(MEMNET)具有存储长期信息的优势。但是,作为解码器,它并未得到充分利用视频字幕。部分原因是从memnet解码的序列难度。在本文中,我们设计了一个新颖的记忆解码器,用于视频字幕。具体地说,在通过预训练的网络获得每个帧的表示后,我们首先融合了视觉和词汇信息。然后,在每个时间步骤中,我们构建了一个基于多层MEMNET的解码器,即在每一层中,我们使用一个内存集来存储以前的信息和一个注意机制来选择与当前输入相关的信息。因此,该解码器避免了长期信息的稀释。多层体系结构有助于捕获框架和单词序列之间的依赖关系。实验结果表明,即使没有编码网络,我们的解码器仍然可以获得竞争性能并优于RNN解码器的性能。此外,与一层RNN解码器相比,我们的解码器的参数较少。
Recent advances of video captioning often employ a recurrent neural network (RNN) as the decoder. However, RNN is prone to diluting long-term information. Recent works have demonstrated memory network (MemNet) has the advantage of storing long-term information. However, as the decoder, it has not been well exploited for video captioning. The reason partially comes from the difficulty of sequence decoding with MemNet. Instead of the common practice, i.e., sequence decoding with RNN, in this paper, we devise a novel memory decoder for video captioning. Concretely, after obtaining representation of each frame through a pre-trained network, we first fuse the visual and lexical information. Then, at each time step, we construct a multi-layer MemNet-based decoder, i.e., in each layer, we employ a memory set to store previous information and an attention mechanism to select the information related to the current input. Thus, this decoder avoids the dilution of long-term information. And the multi-layer architecture is helpful for capturing dependencies between frames and word sequences. Experimental results show that even without the encoding network, our decoder still could obtain competitive performance and outperform the performance of RNN decoder. Furthermore, compared with one-layer RNN decoder, our decoder has fewer parameters.