论文标题
天真的学生:在城市场景细分视频序列中利用半监督学习
Naive-Student: Leveraging Semi-Supervised Learning in Video Sequences for Urban Scene Segmentation
论文作者
论文摘要
大型判别模型中的监督学习是现代计算机视觉的中流tay柱。这种方法需要投资于大规模的人类注销数据集,以实现最先进的结果。反过来,监督学习的功效可能受到人体注释数据集的大小的限制。对于图像分割任务而言,这种限制特别值得注意,其中人类注释的费用特别大,但可能存在大量未标记的数据。在这项工作中,我们询问是否可以在未标记的视频序列和额外的图像中利用半监督的学习来改善城市场景细分的性能,同时解决语义,实例和全景细分。这项工作的目的是避免构建特定于标签传播的复杂,学习的架构(例如,贴片匹配和光流)。取而代之的是,我们只需预测未标记数据的伪标签,并使用人类通知和伪标记的数据进行训练随后的模型。该过程迭代了几次。结果,我们的天真学生模型在所有三个CityScapes基准测试中都获得了如此简单但有效的半监督学习,并在测试集中达到了67.8%的PQ,42.6%AP和85.2%MIOU的最先进的结果。我们将这项工作视为构建简单程序来利用未标记的视频序列和额外图像以超越核心计算机视觉任务上最新性能的额外图像的明显步骤。
Supervised learning in large discriminative models is a mainstay for modern computer vision. Such an approach necessitates investing in large-scale human-annotated datasets for achieving state-of-the-art results. In turn, the efficacy of supervised learning may be limited by the size of the human annotated dataset. This limitation is particularly notable for image segmentation tasks, where the expense of human annotation is especially large, yet large amounts of unlabeled data may exist. In this work, we ask if we may leverage semi-supervised learning in unlabeled video sequences and extra images to improve the performance on urban scene segmentation, simultaneously tackling semantic, instance, and panoptic segmentation. The goal of this work is to avoid the construction of sophisticated, learned architectures specific to label propagation (e.g., patch matching and optical flow). Instead, we simply predict pseudo-labels for the unlabeled data and train subsequent models with both human-annotated and pseudo-labeled data. The procedure is iterated for several times. As a result, our Naive-Student model, trained with such simple yet effective iterative semi-supervised learning, attains state-of-the-art results at all three Cityscapes benchmarks, reaching the performance of 67.8% PQ, 42.6% AP, and 85.2% mIOU on the test set. We view this work as a notable step towards building a simple procedure to harness unlabeled video sequences and extra images to surpass state-of-the-art performance on core computer vision tasks.