通过自我注意力进行视频框架压缩的端到端学习

论文标题

通过自我注意力进行视频框架压缩的端到端学习

End-to-End Learning for Video Frame Compression with Self-Attention

论文作者

Zou, Nannan, Zhang, Honglei, Cricri, Francesco, Tavakoli, Hamed R., Lainema, Jani, Aksu, Emre, Hannuksela, Miska, Rahtu, Esa

论文摘要

传统（即未学习的）视频编解码器的核心组成部分之一是通过利用时间相关性来预测先前被编码框架的帧。在本文中，我们提出了一个端到端学习的系统，用于压缩视频帧。我们的系统不依靠像素空间运动（与光流一样），而是学习框架的深层嵌入并编码它们在潜在空间中的差异。在解码器侧，人们设计了一个注意机制，可以参与框架的潜在空间，以决定如何将上一个和当前框架的不同部分组合在一起以形成最终的预测当前帧。通过使用对特征通道上的重要性面具来实现空间变化的通道分配。通过最大程度地减少对算术编码的上下文模型对概率输出的损失，对模型进行了训练以降低比特率。在我们的实验中，我们表明所提出的系统可实现高压率和高度客观的视觉质量，如MS-SSIM和PSNR所测量。此外，我们提供消融研究，在其中强调不同组成部分的贡献。

One of the core components of conventional (i.e., non-learned) video codecs consists of predicting a frame from a previously-decoded frame, by leveraging temporal correlations. In this paper, we propose an end-to-end learned system for compressing video frames. Instead of relying on pixel-space motion (as with optical flow), our system learns deep embeddings of frames and encodes their difference in latent space. At decoder-side, an attention mechanism is designed to attend to the latent space of frames to decide how different parts of the previous and current frame are combined to form the final predicted current frame. Spatially-varying channel allocation is achieved by using importance masks acting on the feature-channels. The model is trained to reduce the bitrate by minimizing a loss on importance maps and a loss on the probability output by a context model for arithmetic coding. In our experiments, we show that the proposed system achieves high compression rates and high objective visual quality as measured by MS-SSIM and PSNR. Furthermore, we provide ablation studies where we highlight the contribution of different components.

下载PDF全文

下载文献需遵守相关版权规定

论文标题