Hypersim：用于整体室内场景的感性合成数据集

论文标题

Hypersim：用于整体室内场景的感性合成数据集

Hypersim: A Photorealistic Synthetic Dataset for Holistic Indoor Scene Understanding

论文作者

Roberts, Mike, Ramapuram, Jason, Ranjan, Anurag, Kumar, Atulit, Bautista, Miguel Angel, Paczan, Nathan, Webb, Russ, Susskind, Joshua M.

论文摘要

对于许多理解任务的基本场景，很难或不可能从真实图像中获得每个像素地面真相标签。我们通过引入Hypersim，这是一种逼真的合成数据集，以解决整体室内场景的理解，以应对这一挑战。为了创建我们的数据集，我们利用专业艺术家创建的合成场景的大量存储库，并生成了77,400张图像，其中包括461个室内场景，并具有详细的每个像素标签和相应的地面真相几何形状。我们的数据集：（1）仅依赖于公开可用的3D资产；（2）包括每个场景的完整场景几何，物质信息和照明信息；（3）包括每个像素语义实例分割和每个图像的完整相机信息；（4）将每个图像变成弥漫性反射率，扩散照明以及捕获依赖视图的照明效应的非扩散残差项。我们在场景，对象和像素的级别上分析数据集，并分析了金钱，计算时间和注释工作的成本。值得注意的是，我们发现可以从头开始生成整个数据集，大约是培训流行的开源自然语言处理模型的一半。我们还评估了两个现实世界中的SIM到实现传输性能理解任务 - 语义细分和3D形状预测 - 我们发现在数据集中的预培训显着提高了这两个任务的性能，并在最具挑战性的PIX3D测试集上实现了最新的性能。我们的所有渲染图像数据以及我们用来生成数据集并执行实验的所有代码都可以在线获得。

For many fundamental scene understanding tasks, it is difficult or impossible to obtain per-pixel ground truth labels from real images. We address this challenge by introducing Hypersim, a photorealistic synthetic dataset for holistic indoor scene understanding. To create our dataset, we leverage a large repository of synthetic scenes created by professional artists, and we generate 77,400 images of 461 indoor scenes with detailed per-pixel labels and corresponding ground truth geometry. Our dataset: (1) relies exclusively on publicly available 3D assets; (2) includes complete scene geometry, material information, and lighting information for every scene; (3) includes dense per-pixel semantic instance segmentations and complete camera information for every image; and (4) factors every image into diffuse reflectance, diffuse illumination, and a non-diffuse residual term that captures view-dependent lighting effects. We analyze our dataset at the level of scenes, objects, and pixels, and we analyze costs in terms of money, computation time, and annotation effort. Remarkably, we find that it is possible to generate our entire dataset from scratch, for roughly half the cost of training a popular open-source natural language processing model. We also evaluate sim-to-real transfer performance on two real-world scene understanding tasks - semantic segmentation and 3D shape prediction - where we find that pre-training on our dataset significantly improves performance on both tasks, and achieves state-of-the-art performance on the most challenging Pix3D test set. All of our rendered image data, as well as all the code we used to generate our dataset and perform our experiments, is available online.

下载PDF全文

下载文献需遵守相关版权规定

论文标题