隐式梯度正则化

论文标题

隐式梯度正则化

Implicit Gradient Regularization

论文作者

Barrett, David G. T., Dherin, Benoit

论文摘要

梯度下降可能令人惊讶地擅长优化深层神经网络而不会过度拟合并且没有明确的正则化。我们发现，梯度下降的离散步骤通过惩罚具有较大损耗梯度的梯度下降轨迹来隐式化模型。我们称之为隐式梯度正则化（IGR），并使用向后错误分析来计算此正则化的大小。我们从经验上确认，隐式梯度正则化偏向梯度下降到平面最小值，其中测试误差很小，溶液对嘈杂的参数扰动是可靠的。此外，我们证明了隐式梯度正规化项可以用作显式正常化程序，从而使我们能够直接控制此梯度正则化。更广泛地说，我们的工作表明，向后错误分析是一种有用的理论方法，即学习率，模型大小和参数正则化如何相互作用以确定用梯度下降优化的过度参数化模型的性质。

Gradient descent can be surprisingly good at optimizing deep neural networks without overfitting and without explicit regularization. We find that the discrete steps of gradient descent implicitly regularize models by penalizing gradient descent trajectories that have large loss gradients. We call this Implicit Gradient Regularization (IGR) and we use backward error analysis to calculate the size of this regularization. We confirm empirically that implicit gradient regularization biases gradient descent toward flat minima, where test errors are small and solutions are robust to noisy parameter perturbations. Furthermore, we demonstrate that the implicit gradient regularization term can be used as an explicit regularizer, allowing us to control this gradient regularization directly. More broadly, our work indicates that backward error analysis is a useful theoretical approach to the perennial question of how learning rate, model size, and parameter regularization interact to determine the properties of overparameterized models optimized with gradient descent.

下载PDF全文

下载文献需遵守相关版权规定

论文标题