论文标题
利用未标记数据的多样性用于标签有效的半监督活跃学习
Exploiting Diversity of Unlabeled Data for Label-Efficient Semi-Supervised Active Learning
论文作者
论文摘要
大型标记数据集的可用性是深度学习成功的关键组成部分。但是,大型数据集上的标签通常很耗时且昂贵。主动学习是一个研究领域,通过选择最重要的标签样本来解决昂贵标签的问题。基于多样性的采样算法被称为积极学习的基于表示形式的方法的组成部分。在本文中,我们介绍了一种新的基于多样性的初始数据集选择算法,以选择在主动学习设置中初始标记的最有用的样本集。自我监督的表示学习用于考虑初始数据集选择算法中样品的多样性。另外,我们提出了一种新型的主动学习查询策略,该策略使用基于多样性的基于一致性的嵌入方式采样。通过考虑基于一致性的嵌入方案中多样性的一致性信息,该方法可以在半监督的学习环境中选择更多信息的样本来标记。比较实验表明,通过利用未标记的数据的多样性,与先前的主动学习方法相比,所提出的方法在CIFAR-10和CALTECH-101数据集上取得了令人信服的结果。
The availability of large labeled datasets is the key component for the success of deep learning. However, annotating labels on large datasets is generally time-consuming and expensive. Active learning is a research area that addresses the issues of expensive labeling by selecting the most important samples for labeling. Diversity-based sampling algorithms are known as integral components of representation-based approaches for active learning. In this paper, we introduce a new diversity-based initial dataset selection algorithm to select the most informative set of samples for initial labeling in the active learning setting. Self-supervised representation learning is used to consider the diversity of samples in the initial dataset selection algorithm. Also, we propose a novel active learning query strategy, which uses diversity-based sampling on consistency-based embeddings. By considering the consistency information with the diversity in the consistency-based embedding scheme, the proposed method could select more informative samples for labeling in the semi-supervised learning setting. Comparative experiments show that the proposed method achieves compelling results on CIFAR-10 and Caltech-101 datasets compared with previous active learning approaches by utilizing the diversity of unlabeled data.