通过自我头姿势估计估计自我体姿势估计

论文标题

通过自我头姿势估计估计自我体姿势估计

Ego-Body Pose Estimation via Ego-Head Pose Estimation

论文作者

Li, Jiaman, Liu, C. Karen, Wu, Jiajun

论文摘要

从以自我为中心的视频序列估算3D人类运动在人类行为理解中起着至关重要的作用，并且在VR/AR中具有各种应用。但是，天真地学习以自我为中心的视频和人类动作之间的映射是具有挑战性的，因为放置在用户头上的前置摄像头通常不会观察到用户的身体。此外，以配对的以自我为中心的视频和3D人类动作收集大规模的高质量数据集需要准确的运动捕获设备，这通常将视频中各种场景限制在类似实验室的环境中。为了消除对以自我为中心的视频和人类动作的需求，我们提出了一种新方法，通过自我头姿势估计（Egoego）进行自我体体姿势估计，该方法将问题分解为两个阶段，并通过头部运动作为中间表示连接。 Egoego首先整合了SLAM和一种学习方法来估计准确的头部运动。随后，利用估计的头姿势作为输入，Egoego利用条件扩散来产生多个合理的全身运动。头部和身体姿势的这种分离无需配对以自我为中心的视频和3D人类运动来训练数据集，使我们能够分别利用大规模的egepentric视频数据集和动作捕获数据集。此外，对于系统的基准测试，我们开发了一个合成数据集，Amass-Replica-Ego-Synn（ARES），并配对以自我为中心的视频和人类运动。在ARE和真实数据上，我们的自我模型的性能明显优于当前最新方法。

Estimating 3D human motion from an egocentric video sequence plays a critical role in human behavior understanding and has various applications in VR/AR. However, naively learning a mapping between egocentric videos and human motions is challenging, because the user's body is often unobserved by the front-facing camera placed on the head of the user. In addition, collecting large-scale, high-quality datasets with paired egocentric videos and 3D human motions requires accurate motion capture devices, which often limit the variety of scenes in the videos to lab-like environments. To eliminate the need for paired egocentric video and human motions, we propose a new method, Ego-Body Pose Estimation via Ego-Head Pose Estimation (EgoEgo), which decomposes the problem into two stages, connected by the head motion as an intermediate representation. EgoEgo first integrates SLAM and a learning approach to estimate accurate head motion. Subsequently, leveraging the estimated head pose as input, EgoEgo utilizes conditional diffusion to generate multiple plausible full-body motions. This disentanglement of head and body pose eliminates the need for training datasets with paired egocentric videos and 3D human motion, enabling us to leverage large-scale egocentric video datasets and motion capture datasets separately. Moreover, for systematic benchmarking, we develop a synthetic dataset, AMASS-Replica-Ego-Syn (ARES), with paired egocentric videos and human motion. On both ARES and real data, our EgoEgo model performs significantly better than the current state-of-the-art methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题