论文标题
学习热图风格的拼图拼图为2D人姿势估计提供了良好的预处理
Learning Heatmap-Style Jigsaw Puzzles Provides Good Pretraining for 2D Human Pose Estimation
论文作者
论文摘要
2D人姿势估计的目标是从输入2D图像中找到身体部位的关键。姿势估计的最新方法通常构建来自关键点的像素的热图作为学习卷积神经网络的标签,这些标签通常是随机初始化的,或在Imagenet上使用分类模型作为骨干。我们注意到,2D姿势估计任务高度取决于图像贴片之间的上下文关系,因此我们引入了一种预处理2D姿势估计网络的自我监督方法。具体来说,我们提出了热图式的拼图拼图(HSJP)问题作为我们的借口任务,其目标是从由洗牌贴片组成的图像中学习每个贴片的位置。在训练过程中,我们仅使用MS-Coco中人实例的图像,而不是引入额外的和更大的Imagenet数据集。设计了用于补丁位置的热图式标签,我们的学习过程是以非对抗性的。 HSJP借口任务所学的权重被用作2D人姿势估计器的骨干,然后在MS-Coco人类关键点数据集中对其进行填充。借助两个流行和强的2D人姿势估计器HRNET和SimpleBaseline,我们评估了MS-Coco验证和测试-DEV数据集的地图得分。我们的实验表明,下游构成估计量与我们的自我监督预处理相比,其性能要比从头开始训练的估计量要好得多,并且与使用Imagenet分类模型作为初始骨架的估计量相当。
The target of 2D human pose estimation is to locate the keypoints of body parts from input 2D images. State-of-the-art methods for pose estimation usually construct pixel-wise heatmaps from keypoints as labels for learning convolution neural networks, which are usually initialized randomly or using classification models on ImageNet as their backbones. We note that 2D pose estimation task is highly dependent on the contextual relationship between image patches, thus we introduce a self-supervised method for pretraining 2D pose estimation networks. Specifically, we propose Heatmap-Style Jigsaw Puzzles (HSJP) problem as our pretext-task, whose target is to learn the location of each patch from an image composed of shuffled patches. During our pretraining process, we only use images of person instances in MS-COCO, rather than introducing extra and much larger ImageNet dataset. A heatmap-style label for patch location is designed and our learning process is in a non-contrastive way. The weights learned by HSJP pretext task are utilised as backbones of 2D human pose estimator, which are then finetuned on MS-COCO human keypoints dataset. With two popular and strong 2D human pose estimators, HRNet and SimpleBaseline, we evaluate mAP score on both MS-COCO validation and test-dev datasets. Our experiments show that downstream pose estimators with our self-supervised pretraining obtain much better performance than those trained from scratch, and are comparable to those using ImageNet classification models as their initial backbones.