论文标题

重新构造:检索提示的文本对图像生成器

Re-Imagen: Retrieval-Augmented Text-to-Image Generator

论文作者

Chen, Wenhu, Hu, Hexiang, Saharia, Chitwan, Cohen, William W.

论文摘要

关于文本到图像生成的研究在产生多种多样和照片现实的图像方面取得了重大进展,这些图像受到大规模图像文本数据训练的扩散和自动回归模型的驱动。尽管最先进的模型可以产生共同实体的高质量图像,但它们通常很难产生罕见实体的图像,例如“ chortai(dog)”或“ picarones(food)”。为了解决这个问题,我们介绍了检索型的文本对图像生成器(Re-Imagen),这是一种生成模型,它使用检索到的信息来产生高保真和忠实的图像,即使是针对稀有或看不见的实体。给定文本提示,重新构造访问外部多模式知识库,以检索相关(图像,文本)对,并将它们用作生成图像的参考。通过此检索步骤,重新构造的知识是对上述实体的高级语义和低级视觉细节的了解,从而提高了其在产生实体视觉外观的准确性。我们在包含(图像,文本,检索)的构造数据集上训练重新构造,以教导该模型以在文本提示和检索上进行基础。此外,我们制定了一种新的抽样策略,以交流有关文本和检索条件的无分类指南,以平衡文本和检索对齐。 Re-Imagen在Coco和Wikiimage上获得了FID得分的显着增长。为了进一步评估该模型的功能,我们介绍了EntityDrawBench,这是一种新的基准,该基准评估了从频繁到稀有的各种物体类别,包括狗,食物,地标,鸟类和角色,从频繁到稀有的不同实体的图像产生。对实体DrawBench的人类评估表明,重新构造可以显着提高生成的图像的保真度,尤其是在较不频繁的实体上。

Research on text-to-image generation has witnessed significant progress in generating diverse and photo-realistic images, driven by diffusion and auto-regressive models trained on large-scale image-text data. Though state-of-the-art models can generate high-quality images of common entities, they often have difficulty generating images of uncommon entities, such as `Chortai (dog)' or `Picarones (food)'. To tackle this issue, we present the Retrieval-Augmented Text-to-Image Generator (Re-Imagen), a generative model that uses retrieved information to produce high-fidelity and faithful images, even for rare or unseen entities. Given a text prompt, Re-Imagen accesses an external multi-modal knowledge base to retrieve relevant (image, text) pairs and uses them as references to generate the image. With this retrieval step, Re-Imagen is augmented with the knowledge of high-level semantics and low-level visual details of the mentioned entities, and thus improves its accuracy in generating the entities' visual appearances. We train Re-Imagen on a constructed dataset containing (image, text, retrieval) triples to teach the model to ground on both text prompt and retrieval. Furthermore, we develop a new sampling strategy to interleave the classifier-free guidance for text and retrieval conditions to balance the text and retrieval alignment. Re-Imagen achieves significant gain on FID score over COCO and WikiImage. To further evaluate the capabilities of the model, we introduce EntityDrawBench, a new benchmark that evaluates image generation for diverse entities, from frequent to rare, across multiple object categories including dogs, foods, landmarks, birds, and characters. Human evaluation on EntityDrawBench shows that Re-Imagen can significantly improve the fidelity of generated images, especially on less frequent entities.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源