综合：对象的构图和事件的构图

论文标题

综合：对象的构图和事件的构图

ComPhy: Compositional Physical Reasoning of Objects and Events from Videos

论文作者

Chen, Zhenfang, Yi, Kexin, Li, Yunzhu, Ding, Mingyu, Torralba, Antonio, Tenenbaum, Joshua B., Gan, Chuang

论文摘要

对象在自然界的动作受复杂的相互作用及其特性的控制。虽然可以通过对象的视觉外观来识别某些属性，例如形状和材料，但其他属性也不直接可见。可见的和隐藏特性之间的组成性为AI模型从物理世界推理带来了独特的挑战，而人类可以轻松地通过有限的观察来推断它们。关于视频推理的现有研究主要集中于可观察到的元素，例如对象外观，运动和接触互动。在本文中，我们采取了第一步，以突出通过引入组成物理推理（COMPHY）数据集来推断隐藏的物理属性的重要性。对于给定的一组对象，包括在不同初始条件下移动和交互的视频很少。该模型根据其揭示构图隐藏属性（例如质量和电荷）的能力进行评估，并使用这些知识来回答其中一个视频中发布的一组问题。综合上几种最先进的视频推理模型的评估结果表现出不令人满意的性能，因为它们无法捕获这些隐藏的属性。我们进一步提出了一个名为组成物理学习者（CPL）的Oracle神经符号框架，将视觉感知，物理属性学习，动态预测和符号执行结合在一起。 CPL可以从其交互中有效地识别对象的物理属性，并预测其动态以回答问题。

Objects' motions in nature are governed by complex interactions and their properties. While some properties, such as shape and material, can be identified via the object's visual appearances, others like mass and electric charge are not directly visible. The compositionality between the visible and hidden properties poses unique challenges for AI models to reason from the physical world, whereas humans can effortlessly infer them with limited observations. Existing studies on video reasoning mainly focus on visually observable elements such as object appearance, movement, and contact interaction. In this paper, we take an initial step to highlight the importance of inferring the hidden physical properties not directly observable from visual appearances, by introducing the Compositional Physical Reasoning (ComPhy) dataset. For a given set of objects, ComPhy includes few videos of them moving and interacting under different initial conditions. The model is evaluated based on its capability to unravel the compositional hidden properties, such as mass and charge, and use this knowledge to answer a set of questions posted on one of the videos. Evaluation results of several state-of-the-art video reasoning models on ComPhy show unsatisfactory performance as they fail to capture these hidden properties. We further propose an oracle neural-symbolic framework named Compositional Physics Learner (CPL), combining visual perception, physical property learning, dynamic prediction, and symbolic execution into a unified framework. CPL can effectively identify objects' physical properties from their interactions and predict their dynamics to answer questions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题