论文标题
正式化双重下降:为什么会发生以及如何消除它
Regularization-wise double descent: Why it occurs and how to eliminate it
论文作者
论文摘要
过度参数模型的风险,特别是深神经网络,通常是双重变形的,这是模型大小的函数。最近,结果表明,随着早期时间的函数的函数也可以是双重形状的,并且可以将这种行为解释为偏见 - 差异权衡的超级位置。在本文中,我们表明,在理论和实践中,明确的L2型模型的风险可以表现出双重下降行为,这是正则化强度的函数。我们发现,对于线性回归,双重下降形状的风险是由与模型不同部分相对应的偏置方差权衡取舍的叠加引起的,并且可以通过适当地缩放每个部分的正则化强度来减轻。在此结果的激励下,我们研究了一个两层神经网络,并表明可以通过调整第一层和第二层的正则化强度来消除双重下降。最后,我们研究了5层CNN和RESNET-18在CIFAR-10上训练,具有标签噪声,CIFAR-100无标记的噪声,并证明所有噪声都表现出双重下降行为,这是正则强度的函数。
The risk of overparameterized models, in particular deep neural networks, is often double-descent shaped as a function of the model size. Recently, it was shown that the risk as a function of the early-stopping time can also be double-descent shaped, and this behavior can be explained as a super-position of bias-variance tradeoffs. In this paper, we show that the risk of explicit L2-regularized models can exhibit double descent behavior as a function of the regularization strength, both in theory and practice. We find that for linear regression, a double descent shaped risk is caused by a superposition of bias-variance tradeoffs corresponding to different parts of the model and can be mitigated by scaling the regularization strength of each part appropriately. Motivated by this result, we study a two-layer neural network and show that double descent can be eliminated by adjusting the regularization strengths for the first and second layer. Lastly, we study a 5-layer CNN and ResNet-18 trained on CIFAR-10 with label noise, and CIFAR-100 without label noise, and demonstrate that all exhibit double descent behavior as a function of the regularization strength.