通过学习基于学习的个性化头姿势，音频驱动的会说话的面部视频生成

论文标题

通过学习基于学习的个性化头姿势，音频驱动的会说话的面部视频生成

Audio-driven Talking Face Video Generation with Learning-based Personalized Head Pose

论文作者

Yi, Ran, Ye, Zipeng, Zhang, Juyong, Bao, Hujun, Liu, Yong-Jin

论文摘要

真实的谈话面孔通常伴随着自然的头部运动。但是，大多数现有的说话面部视频生成方法仅考虑带有固定头姿势的面部动画。在本文中，我们通过提出一个深层神经网络模型来解决此问题，该模型将源代码人的音频信号A和目标人作为输入的非常短的视频V作为输入，并输出一个合成的高质量的说话式视频，并具有个性化的头部姿势（在V中使用视觉信息），表达和LIP同步（通过考虑A和V）。在我们工作中最具挑战性的问题是，自然姿势通常会导致平面内和平面外的头部旋转，这使得合成的会说话的脸部视频远非现实。为了应对这一挑战，我们重建3D面对动画，然后将其重新渲染为合成的帧。为了将这些框架细化为具有光滑的背景过渡的逼真的框架，我们提出了一个新颖的记忆增强的GAN模块。通过首先培训基于公开数据集的一般映射，并使用目标人的输入简短视频对映射进行微调，我们制定了一种有效的策略，该策略仅需要少数框架（约300帧）来学习个性化的说话行为，包括头姿势。广泛的实验和两项用户研究表明，我们的方法可以产生高质量的（即个性化的头部运动，表情和良好的唇部同步）会说话的面部视频，而与最新方法相比，它们自然地看上去具有更明显的头部运动效果。

Real-world talking faces often accompany with natural head movement. However, most existing talking face video generation methods only consider facial animation with fixed head pose. In this paper, we address this problem by proposing a deep neural network model that takes an audio signal A of a source person and a very short video V of a target person as input, and outputs a synthesized high-quality talking face video with personalized head pose (making use of the visual information in V), expression and lip synchronization (by considering both A and V). The most challenging issue in our work is that natural poses often cause in-plane and out-of-plane head rotations, which makes synthesized talking face video far from realistic. To address this challenge, we reconstruct 3D face animation and re-render it into synthesized frames. To fine tune these frames into realistic ones with smooth background transition, we propose a novel memory-augmented GAN module. By first training a general mapping based on a publicly available dataset and fine-tuning the mapping using the input short video of target person, we develop an effective strategy that only requires a small number of frames (about 300 frames) to learn personalized talking behavior including head pose. Extensive experiments and two user studies show that our method can generate high-quality (i.e., personalized head movements, expressions and good lip synchronization) talking face videos, which are naturally looking with more distinguishing head movement effects than the state-of-the-art methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题