通过相应和量化神经3D场景表示形式的3D对象识别

论文标题

通过相应和量化神经3D场景表示形式的3D对象识别

3D Object Recognition By Corresponding and Quantizing Neural 3D Scene Representations

论文作者

Prabhudesai, Mihir, Lal, Shamit, Tung, Hsiao-Yu Fish, Harley, Adam W., Potdar, Shubhankar, Fragkiadaki, Katerina

论文摘要

我们提出了一个学会检测对象并在RGB-D图像中推断其3D姿势的系统。许多现有系统可以识别对象并推断3D姿势，但它们在很大程度上依赖人类标签和3D注释。这里的挑战是在不依靠强大的监督信号的情况下实现这一目标。为了应对这一挑战，我们提出了一个模型，该模型将RGB-D图像映射到一组3D视觉特征图以通过预测视图来监督的全面卷积方式。 3D功能地图对应于图像中描绘的3D世界场景的特征。对象3D特征表示对相机视图更改或缩放是不变的，这意味着功能匹配可以在不同的相机视点下识别类似的对象。我们可以通过在尺度和3D旋转之间搜索对齐方式来比较两个对象的3D特征图，并且由于操作，我们可以估算姿势和比例更改，而无需3D姿势注释。我们将对象的特征映射到一组3D原型中，这些原型代表了规范尺度和方向中熟悉的对象。然后，我们通过推断每个检测到的对象的原型身份和3D姿势来解析图像。我们将我们的方法与许多不学习3D特征视觉表示形式或不尝试在场景跨场合相通的功能的基线的方法进行比较，并在对象检索和对象姿势估计的任务中以很大的范围优于它们。得益于以对象为中心的特征图的3D性质，视觉相似性提示与3D姿势变化或小规模更改不变，这使我们的方法比2D和1D方法具有优势。

We propose a system that learns to detect objects and infer their 3D poses in RGB-D images. Many existing systems can identify objects and infer 3D poses, but they heavily rely on human labels and 3D annotations. The challenge here is to achieve this without relying on strong supervision signals. To address this challenge, we propose a model that maps RGB-D images to a set of 3D visual feature maps in a differentiable fully-convolutional manner, supervised by predicting views. The 3D feature maps correspond to a featurization of the 3D world scene depicted in the images. The object 3D feature representations are invariant to camera viewpoint changes or zooms, which means feature matching can identify similar objects under different camera viewpoints. We can compare the 3D feature maps of two objects by searching alignment across scales and 3D rotations, and, as a result of the operation, we can estimate pose and scale changes without the need for 3D pose annotations. We cluster object feature maps into a set of 3D prototypes that represent familiar objects in canonical scales and orientations. We then parse images by inferring the prototype identity and 3D pose for each detected object. We compare our method to numerous baselines that do not learn 3D feature visual representations or do not attempt to correspond features across scenes, and outperform them by a large margin in the tasks of object retrieval and object pose estimation. Thanks to the 3D nature of the object-centric feature maps, the visual similarity cues are invariant to 3D pose changes or small scale changes, which gives our method an advantage over 2D and 1D methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题