论文标题
用于3D人姿势估计的多尺度网络具有推理阶段优化
Multi-Scale Networks for 3D Human Pose Estimation with Inference Stage Optimization
论文作者
论文摘要
从单眼视频中估算3D人类姿势仍然是一项艰巨的任务。当目标人员被其他对象遮挡时,许多现有方法的性能下降,或者相对于训练数据的规模和速度,运动太快/缓慢。此外,这些方法中的许多未在严重的闭塞下明确设计或训练,从而使其在处理闭塞方面的性能受到损害。在解决这些问题时,我们引入了一个时空网络,以进行鲁棒的3D人体姿势估计。由于视频中的人类可能以不同的尺度出现并具有各种运动速度,因此我们将多尺度的空间特征应用于每个单独框架中的2D接头或关键点,以及多层次的时间卷积网络(TCN)来估计3D关节或关键点。此外,我们根据身体结构以及肢体运动设计了一个时空鉴别因子,以评估预测的姿势是否形成有效的姿势和有效的运动。在训练过程中,我们明确掩盖了一些关键点,以模拟从次要到严重闭塞的各种遮挡病例,以便我们的网络可以更好地学习并在各种遮挡度上变得强大。由于存在有限的3D地面真实数据,因此我们进一步利用2D视频数据将半监视的学习能力注入我们的网络。此外,我们观察到,由于视频和图像训练数据集之间的不同姿势变化,3D姿势预测与2D姿势估计之间存在差异。因此,我们提出基于置信的推理阶段优化,以适应3D姿势投影以匹配2D姿势估计,以进一步提高最终姿势预测准确性。公共数据集上的实验验证了我们方法的有效性,而消融研究表明了我们网络单个子模型的优势。
Estimating 3D human poses from a monocular video is still a challenging task. Many existing methods' performance drops when the target person is occluded by other objects, or the motion is too fast/slow relative to the scale and speed of the training data. Moreover, many of these methods are not designed or trained under severe occlusion explicitly, making their performance on handling occlusion compromised. Addressing these problems, we introduce a spatio-temporal network for robust 3D human pose estimation. As humans in videos may appear in different scales and have various motion speeds, we apply multi-scale spatial features for 2D joints or keypoints prediction in each individual frame, and multi-stride temporal convolutional networks (TCNs) to estimate 3D joints or keypoints. Furthermore, we design a spatio-temporal discriminator based on body structures as well as limb motions to assess whether the predicted pose forms a valid pose and a valid movement. During training, we explicitly mask out some keypoints to simulate various occlusion cases, from minor to severe occlusion, so that our network can learn better and becomes robust to various degrees of occlusion. As there are limited 3D ground-truth data, we further utilize 2D video data to inject a semi-supervised learning capability to our network. Moreover, we observe that there is a discrepancy between 3D pose prediction and 2D pose estimation due to different pose variations between video and image training datasets. We, therefore propose a confidence-based inference stage optimization to adaptively enforce 3D pose projection to match 2D pose estimation to further improve final pose prediction accuracy. Experiments on public datasets validate the effectiveness of our method, and our ablation studies show the strengths of our network's individual submodules.