论文标题
记住以概括:关于在高维线性回归中插值的必要性
Memorize to Generalize: on the Necessity of Interpolation in High Dimensional Linear Regression
论文作者
论文摘要
我们研究了过度参数化模型中插值的必要性,也就是说,在实现机器学习问题的最佳预测风险时,需要(几乎)插值培训数据。特别是,我们考虑简单的过度参数化线性回归$ y =xθ+ w $,随机设计$ x \ in \ mathbb {r}^{n \ times d} $在比例的渐近学$ d/n \toγ\ in(1,\ infty)$中。我们精确地表征了预测(测试)错误在此设置中必然会通过训练错误缩放。这种表征的含义是,作为标签噪声方差$σ^2 \至0 $,任何估计器至少造成$ \ mathsf {c}σ^4 $训练错误,对于某些常数$ \ mathsf {c} $来说,必然是次优的,并且至少在训练错误中遭受过多预测错误。因此,最佳性能要求将培训数据拟合到比问题的固有噪声层大大更高的精度。
We examine the necessity of interpolation in overparameterized models, that is, when achieving optimal predictive risk in machine learning problems requires (nearly) interpolating the training data. In particular, we consider simple overparameterized linear regression $y = X θ+ w$ with random design $X \in \mathbb{R}^{n \times d}$ under the proportional asymptotics $d/n \to γ\in (1, \infty)$. We precisely characterize how prediction (test) error necessarily scales with training error in this setting. An implication of this characterization is that as the label noise variance $σ^2 \to 0$, any estimator that incurs at least $\mathsf{c}σ^4$ training error for some constant $\mathsf{c}$ is necessarily suboptimal and will suffer growth in excess prediction error at least linear in the training error. Thus, optimal performance requires fitting training data to substantially higher accuracy than the inherent noise floor of the problem.