半监督经验风险最小化：使用未标记的数据来改善预测

论文标题

半监督经验风险最小化：使用未标记的数据来改善预测

Semi-Supervised Empirical Risk Minimization: Using unlabeled data to improve prediction

论文作者

Yuval, Oren, Rosset, Saharon

论文摘要

我们提出了一种使用未标记的数据来设计经验风险最小化（ERM）学习过程的半度监督学习（SSL）变体的一般方法。着眼于广义线性回归，我们分析了SSL方法在改善预测性能方面的有效性。关键的想法仔细考虑了零模型为竞争对手，并利用未标记的数据来确定SSL优于监督学习和无效模型的信号噪声组合。然后，我们根据信号和噪声的估计以自适应方式使用SSL。在使用高斯协变量的线性回归的特殊情况下，我们证明非自适应SSL版本实际上无法同时改善监督估计器和无效模型，超出了可忽略不计的O（1/N）术语。另一方面，这项工作中提出的自适应模型可以在各种环境下同时对两个竞争对手实现实质性改进。这是通过广泛的模拟从经验上显示的，并扩展到其他场景，例如非高斯协变量，未指定的线性回归或具有非线性链路函数的广义线性回归。

We present a general methodology for using unlabeled data to design semi supervised learning (SSL) variants of the Empirical Risk Minimization (ERM) learning process. Focusing on generalized linear regression, we analyze of the effectiveness of our SSL approach in improving prediction performance. The key ideas are carefully considering the null model as a competitor, and utilizing the unlabeled data to determine signal-noise combinations where SSL outperforms both supervised learning and the null model. We then use SSL in an adaptive manner based on estimation of the signal and noise. In the special case of linear regression with Gaussian covariates, we prove that the non-adaptive SSL version is in fact not capable of improving on both the supervised estimator and the null model simultaneously, beyond a negligible O(1/n) term. On the other hand, the adaptive model presented in this work, can achieve a substantial improvement over both competitors simultaneously, under a variety of settings. This is shown empirically through extensive simulations, and extended to other scenarios, such as non-Gaussian covariates, misspecified linear regression, or generalized linear regression with non-linear link functions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题