通过动态卷积内核的音频驱动的说话面部视频生成

论文标题

通过动态卷积内核的音频驱动的说话面部视频生成

Audio-Driven Talking Face Video Generation with Dynamic Convolution Kernels

论文作者

Ye, Zipeng, Xia, Mengfei, Yi, Ran, Zhang, Juyong, Lai, Yu-Kun, Huang, Xuwei, Zhang, Guoxin, Liu, Yong-jin

论文摘要

在本文中，我们提出了卷积神经网络的动态卷积内核（DCK）策略。使用所提出的DCK使用完全卷积的网络，可以实时从多模式源（即无与伦比的音频和视频）生成高质量的说话视频，并且我们训练的模型对不同的身份，头部姿势和输入音频都具有良好的功能。我们提出的DCK是专门为音频驱动的说话面部视频生成而设计的，从而导致了一个简单而有效的端到端系统。我们还提供了理论分析来解释为什么DCK起作用。实验结果表明，我们的方法可以生成具有60 fps的背景的高质量谈话视频。我们的方法和最新方法之间的比较和评估证明了我们方法的优越性。

In this paper, we present a dynamic convolution kernel (DCK) strategy for convolutional neural networks. Using a fully convolutional network with the proposed DCKs, high-quality talking-face video can be generated from multi-modal sources (i.e., unmatched audio and video) in real time, and our trained model is robust to different identities, head postures, and input audios. Our proposed DCKs are specially designed for audio-driven talking face video generation, leading to a simple yet effective end-to-end system. We also provide a theoretical analysis to interpret why DCKs work. Experimental results show that our method can generate high-quality talking-face video with background at 60 fps. Comparison and evaluation between our method and the state-of-the-art methods demonstrate the superiority of our method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题