论文标题
利用光度一致性随着时间的流逝,用于稀疏监督的手动重建
Leveraging Photometric Consistency over Time for Sparsely Supervised Hand-Object Reconstruction
论文作者
论文摘要
对手动操纵进行建模对于了解人类如何与环境相互作用至关重要。尽管很重要,但由于操纵过程中发生的大量相互阻塞,估计相互作用过程中的手和物体的姿势是具有挑战性的。最近的努力是针对需要大量标记培训样本的全面监督方法。但是,为手动相互作用收集3D地面真相数据是昂贵,乏味且容易出错的。为了克服这一挑战,我们提出了一种方法,可以在视频中仅适用于稀疏帧子集的注释时利用光度一致性。我们的模型是在颜色图像上训练有素的端到端,通过推断其姿势共同重建手和物体。鉴于我们的估计重建,我们将相邻图像对之间的光流分化,并在网络中使用它将一个框架扭转到另一帧。然后,我们采用自我监督的光度损失,该损失取决于附近图像之间的视觉一致性。我们在3D手对象重建基准测试基准上实现了最新的结果,并证明我们的方法使我们能够通过利用低数据制度中相邻框架的信息来提高姿势估计准确性。
Modeling hand-object manipulations is essential for understanding how humans interact with their environment. While of practical importance, estimating the pose of hands and objects during interactions is challenging due to the large mutual occlusions that occur during manipulation. Recent efforts have been directed towards fully-supervised methods that require large amounts of labeled training samples. Collecting 3D ground-truth data for hand-object interactions, however, is costly, tedious, and error-prone. To overcome this challenge we present a method to leverage photometric consistency across time when annotations are only available for a sparse subset of frames in a video. Our model is trained end-to-end on color images to jointly reconstruct hands and objects in 3D by inferring their poses. Given our estimated reconstructions, we differentiably render the optical flow between pairs of adjacent images and use it within the network to warp one frame to another. We then apply a self-supervised photometric loss that relies on the visual consistency between nearby images. We achieve state-of-the-art results on 3D hand-object reconstruction benchmarks and demonstrate that our approach allows us to improve the pose estimation accuracy by leveraging information from neighboring frames in low-data regimes.