论文标题
ConvTransFormer:视频框架合成的卷积变压器网络
ConvTransformer: A Convolutional Transformer Network for Video Frame Synthesis
论文作者
论文摘要
深度卷积神经网络(CNN)是有力的模型,在困难的计算机视觉任务上取得了出色的性能。尽管每当有大型标记的训练样本可用时,CNN的表现都很好,但由于对象变形和移动,场景照明变化以及以视频序列移动的相机,它们在视频框架合成上效果不佳。在本文中,我们提出了一种新颖的端到端架构,称为卷积变压器或ConvtransFormer,用于视频框架序列学习和视频框架合成。 ConvtransFormer的核心成分是提出的注意力层,即多头卷积自我发项层,它了解了视频序列的顺序依赖性。 ConvTransFormer使用的编码器,构建在多头卷积自我注意事项层上,对输入帧之间的顺序依赖性进行编码,然后解码器解码目标合成帧和输入帧之间的长期依赖性。视频未来框架外推任务的实验表明,转换器的质量优越,而与卷积LSTM(Convlstm)的最新方法相似。据我们所知,这是第一次提出ConvTransFormer架构并应用于视频框架合成。
Deep Convolutional Neural Networks (CNNs) are powerful models that have achieved excellent performance on difficult computer vision tasks. Although CNNs perform well whenever large labeled training samples are available, they work badly on video frame synthesis due to objects deforming and moving, scene lighting changes, and cameras moving in video sequence. In this paper, we present a novel and general end-to-end architecture, called convolutional Transformer or ConvTransformer, for video frame sequence learning and video frame synthesis. The core ingredient of ConvTransformer is the proposed attention layer, i.e., multi-head convolutional self-attention layer, that learns the sequential dependence of video sequence. ConvTransformer uses an encoder, built upon multi-head convolutional self-attention layer, to encode the sequential dependence between the input frames, and then a decoder decodes the long-term dependence between the target synthesized frames and the input frames. Experiments on video future frame extrapolation task show ConvTransformer to be superior in quality while being more parallelizable to recent approaches built upon convolutional LSTM (ConvLSTM). To the best of our knowledge, this is the first time that ConvTransformer architecture is proposed and applied to video frame synthesis.