Video2Stylegan：在潜在空间中编码视频进行操纵

论文标题

Video2Stylegan：在潜在空间中编码视频进行操纵

Video2StyleGAN: Encoding Video in Latent Space for Manipulation

论文作者

Yu, Jiyang, Liu, Jingen, Huang, Jing, Zhang, Wei, Mei, Tao

论文摘要

通过利用预之前的甘斯的潜在空间，已经提出了许多最近的作品，用于面部图像编辑。但是，很少有尝试将它们直接应用于视频，因为1）他们不能保证时间一致性，2）他们的应用受到视频的处理速度的限制，3）他们无法准确编码面部运动和表达的细节。为此，我们提出了一个新颖的网络，将面部视频编码到Stylegan的潜在空间中，以进行语义面部视频操纵。基于视觉变压器，我们的网络重新恢复了潜在向量的高分辨率部分，以执行时间一致性。为了捕捉微妙的面部运动和表情，我们设计了涉及面部标志性稀疏和密集的3D面部网格的新颖损失。我们已经彻底评估了我们的方法，并成功证明了其对各种面部视频操作的应用。特别是，我们提出了一个新型的网络，以用于3D坐标系中的姿势/表达控制。定性和定量结果都表明，我们的方法可以显着优于现有的单图方法，同时实现实时（66 fps）速度。

Many recent works have been proposed for face image editing by leveraging the latent space of pretrained GANs. However, few attempts have been made to directly apply them to videos, because 1) they do not guarantee temporal consistency, 2) their application is limited by their processing speed on videos, and 3) they cannot accurately encode details of face motion and expression. To this end, we propose a novel network to encode face videos into the latent space of StyleGAN for semantic face video manipulation. Based on the vision transformer, our network reuses the high-resolution portion of the latent vector to enforce temporal consistency. To capture subtle face motions and expressions, we design novel losses that involve sparse facial landmarks and dense 3D face mesh. We have thoroughly evaluated our approach and successfully demonstrated its application to various face video manipulations. Particularly, we propose a novel network for pose/expression control in a 3D coordinate system. Both qualitative and quantitative results have shown that our approach can significantly outperform existing single image methods, while achieving real-time (66 fps) speed.

下载PDF全文

下载文献需遵守相关版权规定

论文标题