使用多尺度运动补偿和时空上下文模型进行神经视频编码

论文标题

使用多尺度运动补偿和时空上下文模型进行神经视频编码

Neural Video Coding using Multiscale Motion Compensation and Spatiotemporal Context Model

论文作者

Liu, Haojie, Lu, Ming, Ma, Zhan, Wang, Fan, Xie, Zhihuang, Cao, Xun, Wang, Yao

论文摘要

在过去的二十年中，传统的基于块的视频编码取得了显着的进步，并产生了一系列众所周知的标准，例如MPEG-4，H.264/AVC和H.265/HEVC。另一方面，深度神经网络（DNN）表现出了其强大的视觉内容理解，特征提取和紧凑表示的能力。以前的一些作品以端到端的方式探索了学到的视频编码算法，与传统方法相比，这表明了巨大的潜力。在本文中，我们提出了一个端到端的深神经视频编码框架（NVC），该框架（NVC）使用具有关节空间和时间范围的先验聚集（PA）的变异自动编码器（VAE）来利用框架内像素，架间动作和框架间补偿残留物的相关性。 NVC的新功能包括：1）要在大量幅度上估算和补偿运动，我们提出了一个无监督的多尺度运动补偿网络（MS-MCN）以及VAE中的金字塔解码器，用于编码运动功能，以生成多尺度流动场，2）我们设计了一种新型的动作模型，以实现适用于运动模型，以实现远程编码，以编码有效地编码。（nlam）在VAE的瓶颈上进行隐式自适应特征提取和激活，利用其高转化能力和与联合全球和本地信息的不等加权，以及4）我们引入了多模块化优化以及多帧训练策略，以最大程度地减少P帧之间的时间误差。对低延迟因果环境进行了评估NVC，并与H.265/HEVC，H.264/AVC以及遵循共同测试条件下的其他博学的视频压缩方法进行了比较，显示了PSNR和MSSSSIM失真度量的所有流行测试序列的一致增长。

Over the past two decades, traditional block-based video coding has made remarkable progress and spawned a series of well-known standards such as MPEG-4, H.264/AVC and H.265/HEVC. On the other hand, deep neural networks (DNNs) have shown their powerful capacity for visual content understanding, feature extraction and compact representation. Some previous works have explored the learnt video coding algorithms in an end-to-end manner, which show the great potential compared with traditional methods. In this paper, we propose an end-to-end deep neural video coding framework (NVC), which uses variational autoencoders (VAEs) with joint spatial and temporal prior aggregation (PA) to exploit the correlations in intra-frame pixels, inter-frame motions and inter-frame compensation residuals, respectively. Novel features of NVC include: 1) To estimate and compensate motion over a large range of magnitudes, we propose an unsupervised multiscale motion compensation network (MS-MCN) together with a pyramid decoder in the VAE for coding motion features that generates multiscale flow fields, 2) we design a novel adaptive spatiotemporal context model for efficient entropy coding for motion information, 3) we adopt nonlocal attention modules (NLAM) at the bottlenecks of the VAEs for implicit adaptive feature extraction and activation, leveraging its high transformation capacity and unequal weighting with joint global and local information, and 4) we introduce multi-module optimization and a multi-frame training strategy to minimize the temporal error propagation among P-frames. NVC is evaluated for the low-delay causal settings and compared with H.265/HEVC, H.264/AVC and the other learnt video compression methods following the common test conditions, demonstrating consistent gains across all popular test sequences for both PSNR and MS-SSIM distortion metrics.

下载PDF全文

下载文献需遵守相关版权规定

论文标题