与概率音频到视觉扩散先验的说话头产生

论文标题

与概率音频到视觉扩散先验的说话头产生

Talking Head Generation with Probabilistic Audio-to-Visual Diffusion Priors

论文作者

Yu, Zhentao, Yin, Zixin, Zhou, Deyu, Wang, Duomin, Wong, Finn, Wang, Baoyuan

论文摘要

在本文中，我们介绍了一个简单而新颖的框架，用于单发音频驱动的说话校长。与以确定性方式需要其他驱动源的先前作品不同，我们相反，我们可以对所有整体唇部动作（即姿势，表情，表达，眨眼，凝视等）进行采样，以便在语义上与输入音频保持匹配，同时仍然保持着同时维持Audio-lip Syncroniss的照片真实性和整体自然化的照片真实性。这是通过我们新提出的音频到视觉扩散来实现的。得益于扩散先验的概率性质，我们框架的一个重要优势是，在相同的音频剪辑中，它可以综合各种面部运动序列，对于许多真实应用程序，这对于用户友好。通过对公共基准测试的全面评估，我们得出的结论是，（1）几乎所有有关指标的扩散先前优先于自动回归；（2）我们的整体系统在音频同步方面具有先前的作品竞争力，但可以有效地对富含和自然的唇部化面部运动进行采样，同时仍与音频输入进行语义协调。

In this paper, we introduce a simple and novel framework for one-shot audio-driven talking head generation. Unlike prior works that require additional driving sources for controlled synthesis in a deterministic manner, we instead probabilistically sample all the holistic lip-irrelevant facial motions (i.e. pose, expression, blink, gaze, etc.) to semantically match the input audio while still maintaining both the photo-realism of audio-lip synchronization and the overall naturalness. This is achieved by our newly proposed audio-to-visual diffusion prior trained on top of the mapping between audio and disentangled non-lip facial representations. Thanks to the probabilistic nature of the diffusion prior, one big advantage of our framework is it can synthesize diverse facial motion sequences given the same audio clip, which is quite user-friendly for many real applications. Through comprehensive evaluations on public benchmarks, we conclude that (1) our diffusion prior outperforms auto-regressive prior significantly on almost all the concerned metrics; (2) our overall system is competitive with prior works in terms of audio-lip synchronization but can effectively sample rich and natural-looking lip-irrelevant facial motions while still semantically harmonized with the audio input.

下载PDF全文

下载文献需遵守相关版权规定

论文标题