Optgan：优化和解释条件文本到图像gan的潜在空间

论文标题

Optgan：优化和解释条件文本到图像gan的潜在空间

OptGAN: Optimizing and Interpreting the Latent Space of the Conditional Text-to-Image GANs

论文作者

Zhang, Zhenxing, Schomaker, Lambert

论文摘要

文本到图像的生成打算自动产生以文字描述为条件的照片真实图像。它可以潜在地用于艺术创作，数据增强，照片编辑等领域。尽管许多努力致力于这项任务，但产生可信的自然场景仍然特别具有挑战性。为了促进文本对图像合成的现实应用，我们专注于研究以下三个问题：1）如何确保生成的样本可信，现实或自然？ 2）如何利用发电机的潜在空间来编辑合成图像？ 3）如何提高文本到图像生成框架的解释性？在这项工作中，根据严格的标准，我们构建了两个新型数据集（即，好与坏鸟类和面部数据集），包括成功的和失败的样品组成。为了通过增加生成良好的潜在代码的概率有效地获取高质量图像，我们使用专用的好/坏分类器来生成的图像。它基于预先训练的前端，并基于建议的好坏数据集进行了微调。之后，我们提出了一种新颖的算法，该算法通过对生成器的预训练的权重值进行独立的组件分析，从而在条件文本到图像gan体系结构的潜在空间中识别语义可信的方向。此外，我们开发了背景损失（BFL），以改善编辑图像中的背景外观。随后，我们在关键字对之间介绍了线性插值分析。这将扩展到类似的三角“语言”插值，以深入了解文本对图像综合模型在语言嵌入中学到的东西。我们的数据集可在https://zenodo.org/record/6283798#.yhkn_ujmi2w上找到。

Text-to-image generation intends to automatically produce a photo-realistic image, conditioned on a textual description. It can be potentially employed in the field of art creation, data augmentation, photo-editing, etc. Although many efforts have been dedicated to this task, it remains particularly challenging to generate believable, natural scenes. To facilitate the real-world applications of text-to-image synthesis, we focus on studying the following three issues: 1) How to ensure that generated samples are believable, realistic or natural? 2) How to exploit the latent space of the generator to edit a synthesized image? 3) How to improve the explainability of a text-to-image generation framework? In this work, we constructed two novel data sets (i.e., the Good & Bad bird and face data sets) consisting of successful as well as unsuccessful generated samples, according to strict criteria. To effectively and efficiently acquire high-quality images by increasing the probability of generating Good latent codes, we use a dedicated Good/Bad classifier for generated images. It is based on a pre-trained front end and fine-tuned on the basis of the proposed Good & Bad data set. After that, we present a novel algorithm which identifies semantically-understandable directions in the latent space of a conditional text-to-image GAN architecture by performing independent component analysis on the pre-trained weight values of the generator. Furthermore, we develop a background-flattening loss (BFL), to improve the background appearance in the edited image. Subsequently, we introduce linear interpolation analysis between pairs of keywords. This is extended into a similar triangular `linguistic' interpolation in order to take a deep look into what a text-to-image synthesis model has learned within the linguistic embeddings. Our data set is available at https://zenodo.org/record/6283798#.YhkN_ujMI2w.

下载PDF全文

下载文献需遵守相关版权规定

论文标题