会说话面部视频中连续可控制的面部表情编辑

论文标题

会说话面部视频中连续可控制的面部表情编辑

Continuously Controllable Facial Expression Editing in Talking Face Videos

论文作者

Sun, Zhiyao, Wen, Yu-Hui, Lv, Tian, Sun, Yanan, Zhang, Ziyang, Wang, Yaoyuan, Liu, Yong-Jin

论文摘要

最近，音频驱动的说话面部视频产生引起了广泛的关注。但是，很少有研究能够解决这些会说话的面部视频的情感编辑问题，并具有连续可控的表达式，这是行业中强烈的需求。面临的挑战是，与言语相关的表达和与情感有关的表达通常是高度耦合的。同时，由于表达式与其他属性（例如姿势）的耦合，即在每个帧中翻译角色的表达可能会同时改变由于训练数据分布的偏见，因此传统的图像到图像翻译方法无法在我们的应用中很好地工作。在本文中，我们提出了一种高质量的面部表达编辑方法，用于说话面部视频，使用户可以连续控制编辑视频中的目标情绪。我们为该任务提供了一个新的视角，作为运动信息编辑的特殊情况，我们使用3DMM捕获主要的面部运动和由StyleGAN模拟的相关纹理图，以捕获外观细节。两种表示（3DMM和纹理图）都包含情感信息，并且可以通过神经网络连续修改，并且可以通过系数/潜在空间平均来轻松平滑，从而使我们的方法变得简单而有效。我们还引入了口腔形状的保存损失，以控制唇部同步和编辑表达的夸张程度之间的权衡。广泛的实验和用户研究表明，我们的方法在各种评估标准中实现了最先进的表现。

Recently audio-driven talking face video generation has attracted considerable attention. However, very few researches address the issue of emotional editing of these talking face videos with continuously controllable expressions, which is a strong demand in the industry. The challenge is that speech-related expressions and emotion-related expressions are often highly coupled. Meanwhile, traditional image-to-image translation methods cannot work well in our application due to the coupling of expressions with other attributes such as poses, i.e., translating the expression of the character in each frame may simultaneously change the head pose due to the bias of the training data distribution. In this paper, we propose a high-quality facial expression editing method for talking face videos, allowing the user to control the target emotion in the edited video continuously. We present a new perspective for this task as a special case of motion information editing, where we use a 3DMM to capture major facial movements and an associated texture map modeled by a StyleGAN to capture appearance details. Both representations (3DMM and texture map) contain emotional information and can be continuously modified by neural networks and easily smoothed by averaging in coefficient/latent spaces, making our method simple yet effective. We also introduce a mouth shape preservation loss to control the trade-off between lip synchronization and the degree of exaggeration of the edited expression. Extensive experiments and a user study show that our method achieves state-of-the-art performance across various evaluation criteria.

下载PDF全文

下载文献需遵守相关版权规定

论文标题