论文标题
自我训练的半监督学习的统计和算法见解
Statistical and Algorithmic Insights for Semi-supervised Learning with Self-training
论文作者
论文摘要
自我训练是半监督学习中的一种经典方法,它成功地应用于各种机器学习问题。自我训练算法为未标记的例子生成伪标记,并逐步完善了这些伪标签,希望与实际标签相吻合。这项工作提供了对自我训练算法的理论见解,重点是线性分类器。我们首先研究了高斯混合模型,并提供了自我训练迭代的鲜明的非反应有限样本表征。我们的分析揭示了以低信心拒绝样本的可证明的好处,并证明自我训练的迭代即使确实陷入了次优的固定点,也可以优雅地提高模型的准确性。然后,我们证明正规化和阶级边缘(即分离)对于成功和缺乏正则化可能会阻止自我训练识别数据中的核心特征非常重要。最后,我们讨论了经验风险最小化的统计方面,并为一般分布进行自我培训。我们展示了如何基于基于自我训练的聚类的纯监督概念的概括概念,可以基于群集边缘形式化。然后,我们建立了基于自训练的半统治与使用异质数据和弱监督的更普遍的学习问题之间的联系。
Self-training is a classical approach in semi-supervised learning which is successfully applied to a variety of machine learning problems. Self-training algorithm generates pseudo-labels for the unlabeled examples and progressively refines these pseudo-labels which hopefully coincides with the actual labels. This work provides theoretical insights into self-training algorithm with a focus on linear classifiers. We first investigate Gaussian mixture models and provide a sharp non-asymptotic finite-sample characterization of the self-training iterations. Our analysis reveals the provable benefits of rejecting samples with low confidence and demonstrates that self-training iterations gracefully improve the model accuracy even if they do get stuck in sub-optimal fixed points. We then demonstrate that regularization and class margin (i.e. separation) is provably important for the success and lack of regularization may prevent self-training from identifying the core features in the data. Finally, we discuss statistical aspects of empirical risk minimization with self-training for general distributions. We show how a purely unsupervised notion of generalization based on self-training based clustering can be formalized based on cluster margin. We then establish a connection between self-training based semi-supervision and the more general problem of learning with heterogenous data and weak supervision.