论文标题
从视觉场景中学习物理图表
Learning Physical Graph Representations from Visual Scenes
论文作者
论文摘要
卷积神经网络(CNN)在视觉对象分类的学习表示方面已被证明是出色的。但是,CNN并未明确编码对象,零件及其物理属性,而对象,零件及其物理属性限制了CNNS在需要对视觉场景的结构性理解的任务上的成功。为了克服这些局限性,我们介绍了物理场景图(PSG)的概念,该图表示为层次结构图,层次结构中的节点与不同尺度的对象零件相对应,并与零件之间的物理连接相对应。绑定到每个节点的是潜在属性的向量,该向量凭直觉代表对象属性,例如表面形状和纹理。我们还描述了PSGNET,这是一种网络体系结构,通过通过PSG结构的瓶颈重建场景来学会提取PSG。 PSGNET通过包括以下方式增强标准CNN,以:循环反馈连接,以结合低和高级图像信息;图形合并和矢量化操作将空间均匀的特征图转换为以对象为中心的图形结构;和感知分组原则,以鼓励识别有意义的场景要素。我们表明,PSGNET在场景分割任务上,尤其是在复杂的现实世界图像上的替代性自我监督场景表示算法的替代性场景表示算法,并且可以很好地推广到看不见的对象类型和场景布置。 PSGNET还可以从物理运动中学习,即使对于静态图像,也可以增强场景估计。我们提供了一系列的消融研究,说明了PSGNET体系结构的每个组件的重要性,分析表明学习潜在属性捕获了直觉场景属性,并说明了PSG用于组成场景推断的使用。
Convolutional Neural Networks (CNNs) have proved exceptional at learning representations for visual object categorization. However, CNNs do not explicitly encode objects, parts, and their physical properties, which has limited CNNs' success on tasks that require structured understanding of visual scenes. To overcome these limitations, we introduce the idea of Physical Scene Graphs (PSGs), which represent scenes as hierarchical graphs, with nodes in the hierarchy corresponding intuitively to object parts at different scales, and edges to physical connections between parts. Bound to each node is a vector of latent attributes that intuitively represent object properties such as surface shape and texture. We also describe PSGNet, a network architecture that learns to extract PSGs by reconstructing scenes through a PSG-structured bottleneck. PSGNet augments standard CNNs by including: recurrent feedback connections to combine low and high-level image information; graph pooling and vectorization operations that convert spatially-uniform feature maps into object-centric graph structures; and perceptual grouping principles to encourage the identification of meaningful scene elements. We show that PSGNet outperforms alternative self-supervised scene representation algorithms at scene segmentation tasks, especially on complex real-world images, and generalizes well to unseen object types and scene arrangements. PSGNet is also able learn from physical motion, enhancing scene estimates even for static images. We present a series of ablation studies illustrating the importance of each component of the PSGNet architecture, analyses showing that learned latent attributes capture intuitive scene properties, and illustrate the use of PSGs for compositional scene inference.