过度参数回归中的降低，正则化和概括

论文标题

过度参数回归中的降低，正则化和概括

Dimensionality reduction, regularization, and generalization in overparameterized regressions

论文作者

Huang, Ningyuan, Hogg, David W., Villar, Soledad

论文摘要

深度学习中的过度参数化非常有力：非常大的模型非常适合训练数据，但通常会概括地概括。这种认识恢复了对回归线性模型的研究，包括普通最小二乘（OLS），就像深度学习一样，显示出“双重味道”的行为：（1）当参数的数量$ p $降低$ n $的风险时，风险（预期的样本外预测错误）可以任意增长。 $ P <n $。正规化可以避免OLS风险的差异。在这项工作中，我们表明，对于某些数据模型，也可以通过降低基于PCA的维度（PCA-OL，也称为主组件回归）来避免使用。我们通过考虑人口的一致性和经验主要成分来为PCA-ols的风险提供非反应界限。我们表明，降低的降低会提高鲁棒性，而OLS则任意容易受到对抗攻击的影响，尤其是在过度参数化的政权中。我们从理论和经验上比较PCA-ols与广泛的基于投影的方法，包括随机投影，部分最小二乘（PLS）和某些类别的线性两层神经网络。这些比较是针对不同的数据生成模型进行的，以评估对信号到噪声的敏感性以及回归系数与特征的比对。我们发现，投影取决于训练数据的方法可以胜过独立于训练数据的投影，即使是那些对人口数量的知识的方法，也是以前已经确定的另一种看似矛盾的现象。这表明过度参数可能不需要良好的概括。

Overparameterization in deep learning is powerful: Very large models fit the training data perfectly and yet often generalize well. This realization brought back the study of linear models for regression, including ordinary least squares (OLS), which, like deep learning, shows a "double-descent" behavior: (1) The risk (expected out-of-sample prediction error) can grow arbitrarily when the number of parameters $p$ approaches the number of samples $n$, and (2) the risk decreases with $p$ for $p>n$, sometimes achieving a lower value than the lowest risk for $p<n$. The divergence of the risk for OLS can be avoided with regularization. In this work, we show that for some data models it can also be avoided with a PCA-based dimensionality reduction (PCA-OLS, also known as principal component regression). We provide non-asymptotic bounds for the risk of PCA-OLS by considering the alignments of the population and empirical principal components. We show that dimensionality reduction improves robustness while OLS is arbitrarily susceptible to adversarial attacks, particularly in the overparameterized regime. We compare PCA-OLS theoretically and empirically with a wide range of projection-based methods, including random projections, partial least squares (PLS), and certain classes of linear two-layer neural networks. These comparisons are made for different data generation models to assess the sensitivity to signal-to-noise and the alignment of regression coefficients with the features. We find that methods in which the projection depends on the training data can outperform methods where the projections are chosen independently of the training data, even those with oracle knowledge of population quantities, another seemingly paradoxical phenomenon that has been identified previously. This suggests that overparameterization may not be necessary for good generalization.

下载PDF全文

下载文献需遵守相关版权规定

论文标题