混合潜在扩散

论文标题

混合潜在扩散

Blended Latent Diffusion

论文作者

Avrahami, Omri, Fried, Ohad, Lischinski, Dani

论文摘要

神经图像产生的巨大进展，再加上看似全能的视觉语言模型的出现，最终使基于文本的界面启用了创建和编辑图像的界面。处理通用图像需要一个不同的潜在生成模型，因此，最新作品利用扩散模型，这些模型在多样性方面被证明超过了gan。但是，扩散模型的一个主要缺点是它们相对较慢的推理时间。在本文中，我们提出了一个加速解决方案，以实施本地文本驱动的通用图像的任务，其中所需的编辑仅限于用户提供的掩码。我们的解决方案利用了最近的文本对图像潜伏扩散模型（LDM），该模型通过在较低维的潜在空间中运行来加快扩散。我们首先通过将混合扩散纳入其中，将LDM转换为本地图像编辑器。接下来，我们为该LDM准确重建图像的固有能力提供了一种基于优化的解决方案。最后，我们解决了使用薄面罩执行本地编辑的方案。我们在定性和定量上对可用基线的方法评估我们的方法，并证明，除了更快的速度外，我们的方法还可以比基线获得更好的精度，同时减轻其某些文物。

The tremendous progress in neural image generation, coupled with the emergence of seemingly omnipotent vision-language models has finally enabled text-based interfaces for creating and editing images. Handling generic images requires a diverse underlying generative model, hence the latest works utilize diffusion models, which were shown to surpass GANs in terms of diversity. One major drawback of diffusion models, however, is their relatively slow inference time. In this paper, we present an accelerated solution to the task of local text-driven editing of generic images, where the desired edits are confined to a user-provided mask. Our solution leverages a recent text-to-image Latent Diffusion Model (LDM), which speeds up diffusion by operating in a lower-dimensional latent space. We first convert the LDM into a local image editor by incorporating Blended Diffusion into it. Next we propose an optimization-based solution for the inherent inability of this LDM to accurately reconstruct images. Finally, we address the scenario of performing local edits using thin masks. We evaluate our method against the available baselines both qualitatively and quantitatively and demonstrate that in addition to being faster, our method achieves better precision than the baselines while mitigating some of their artifacts.

下载PDF全文

下载文献需遵守相关版权规定

论文标题