论文标题
stylefacev:通过分解和重新编写验证的stylegan3
StyleFaceV: Face Video Generation via Decomposing and Recomposing Pretrained StyleGAN3
论文作者
论文摘要
在计算机视觉和图形社区中,长期以来,现实的生成面部视频综合一直是一种追求。但是,现有的视频生成方法倾向于产生具有漂移的面部身份和不自然运动的低品质框架。为了应对这些挑战,我们提出了一个名为stylefacev的原则性框架,该框架生产具有生动动作的高保真身份的面部视频。我们的核心洞察力是分解外观并构成信息,并在StyleGan3的潜在空间中重新组装它们,以产生稳定而动态的结果。具体而言,StyleGan3为高保真面部图像的生成提供了强大的先验,但是潜在空间本质上是纠缠的。通过仔细检查其潜在特性,我们提出了分解和重组设计,从而可以使面部外观和运动的结合结合在一起。此外,依赖于时间的模型建立在分解的潜在特征上,并示例能够生成现实且具有时间连贯的面部视频的合理动作序列。特别是,我们的管道对静态图像和高质量视频数据的联合培训策略进行了培训,该数据具有更高的数据效率。广泛的实验表明,我们的框架可以在定性和定量上实现最先进的视频生成。值得注意的是,即使没有高分辨率培训视频,StyleFacev也能够生成现实$ 1024 \ times1024 $面对视频。
Realistic generative face video synthesis has long been a pursuit in both computer vision and graphics community. However, existing face video generation methods tend to produce low-quality frames with drifted facial identities and unnatural movements. To tackle these challenges, we propose a principled framework named StyleFaceV, which produces high-fidelity identity-preserving face videos with vivid movements. Our core insight is to decompose appearance and pose information and recompose them in the latent space of StyleGAN3 to produce stable and dynamic results. Specifically, StyleGAN3 provides strong priors for high-fidelity facial image generation, but the latent space is intrinsically entangled. By carefully examining its latent properties, we propose our decomposition and recomposition designs which allow for the disentangled combination of facial appearance and movements. Moreover, a temporal-dependent model is built upon the decomposed latent features, and samples reasonable sequences of motions that are capable of generating realistic and temporally coherent face videos. Particularly, our pipeline is trained with a joint training strategy on both static images and high-quality video data, which is of higher data efficiency. Extensive experiments demonstrate that our framework achieves state-of-the-art face video generation results both qualitatively and quantitatively. Notably, StyleFaceV is capable of generating realistic $1024\times1024$ face videos even without high-resolution training videos.