Tedigan：文本引导的多样化的面部图像产生和操纵

论文标题

Tedigan：文本引导的多样化的面部图像产生和操纵

TediGAN: Text-Guided Diverse Face Image Generation and Manipulation

论文作者

Xia, Weihao, Yang, Yujiu, Xue, Jing-Hao, Wu, Baoyuan

论文摘要

在这项工作中，我们提出了Tedigan，这是一个新颖的框架，用于使用文本描述进行多模式图像生成和操纵。提出的方法由三个组成部分组成：stylegan倒置模块，视觉语言相似性学习和实例级别的优化。反演模块将真实图像映射到训练有素的Stylegan的潜在空间。视觉语言相似性通过将图像和文本映射到一个通用的嵌入空间中来了解文本图像匹配。实例级优化用于操纵中的身份。我们的模型可以在1024处使用前所未有的分辨率产生多样化和高质量的图像。使用基于样式混合的控制机制，我们的Tedigan固有地支持具有多模式输入的图像合成，例如草图或语义标记，或者没有实例指导。为了促进文本引导的多模式合成，我们提出了多模式Celeba-HQ，这是一个由真实面部图像和相应的语义分割图，素描和文本描述组成的大规模数据集。引入的数据集上的广泛实验证明了我们提出的方法的出色性能。代码和数据可在https://github.com/weihaox/tedigan上找到。

In this work, we propose TediGAN, a novel framework for multi-modal image generation and manipulation with textual descriptions. The proposed method consists of three components: StyleGAN inversion module, visual-linguistic similarity learning, and instance-level optimization. The inversion module maps real images to the latent space of a well-trained StyleGAN. The visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space. The instance-level optimization is for identity preservation in manipulation. Our model can produce diverse and high-quality images with an unprecedented resolution at 1024. Using a control mechanism based on style-mixing, our TediGAN inherently supports image synthesis with multi-modal inputs, such as sketches or semantic labels, with or without instance guidance. To facilitate text-guided multi-modal synthesis, we propose the Multi-Modal CelebA-HQ, a large-scale dataset consisting of real face images and corresponding semantic segmentation map, sketch, and textual descriptions. Extensive experiments on the introduced dataset demonstrate the superior performance of our proposed method. Code and data are available at https://github.com/weihaox/TediGAN.

下载PDF全文

下载文献需遵守相关版权规定

论文标题