论文标题
幻觉网络:通过利用对象共发生关系来完成场景
Halluci-Net: Scene Completion by Exploiting Object Co-occurrence Relationships
论文作者
论文摘要
最近,语义标记图的图像合成取得了长足的进步。但是,用于此任务的方法假设具有完整和明确的标签图,具有对象的实例边界以及每个像素的类标签的可用性。对大量注释的输入的这种依赖限制了图像合成技术在现实世界应用中的应用,尤其是在由于天气,遮挡或噪声引起的不确定性下。另一方面,可以从稀疏标签图或草图中合成图像的算法是非常需要的工具,可以指导内容创建者和艺术家通过简单地指定一些对象的位置来快速生成场景。在本文中,我们解决了从稀疏标签图中完成复杂场景完成的问题。在此设置下,有关场景(30 \%对象实例的30 \%)的详细信息可作为图像合成的输入。我们提出了一种基于两个阶段的深网方法,称为“幻觉 - 网络”,该方法学习了场景中对象之间的共存在关系,然后利用这些关系来产生一个密集且完整的标签图。然后,生成的致密标记可以通过最先进的图像合成技术(如pix2pixhd)来用作输入,以获得最终图像。在CityScapes数据集上评估了所提出的方法,并且在诸如FréchetInception距离(FID),语义分割准确度和对象共发生的相似性等性能指标上的两种基线方法都优于两种基线方法。我们还在包含卧室图像的ADE20K数据集的一个子集上显示了定性结果。
Recently, there has been substantial progress in image synthesis from semantic labelmaps. However, methods used for this task assume the availability of complete and unambiguous labelmaps, with instance boundaries of objects, and class labels for each pixel. This reliance on heavily annotated inputs restricts the application of image synthesis techniques to real-world applications, especially under uncertainty due to weather, occlusion, or noise. On the other hand, algorithms that can synthesize images from sparse labelmaps or sketches are highly desirable as tools that can guide content creators and artists to quickly generate scenes by simply specifying locations of a few objects. In this paper, we address the problem of complex scene completion from sparse labelmaps. Under this setting, very few details about the scene (30\% of object instances) are available as input for image synthesis. We propose a two-stage deep network based method, called `Halluci-Net', that learns co-occurence relationships between objects in scenes, and then exploits these relationships to produce a dense and complete labelmap. The generated dense labelmap can then be used as input by state-of-the-art image synthesis techniques like pix2pixHD to obtain the final image. The proposed method is evaluated on the Cityscapes dataset and it outperforms two baselines methods on performance metrics like Fréchet Inception Distance (FID), semantic segmentation accuracy, and similarity in object co-occurrences. We also show qualitative results on a subset of ADE20K dataset that contains bedroom images.