随机梯度下降引入了有效的景观依赖性正则化，有利于平坦溶液

论文标题

随机梯度下降引入了有效的景观依赖性正则化，有利于平坦溶液

Stochastic gradient descent introduces an effective landscape-dependent regularization favoring flat solutions

论文作者

Yang, Ning, Tang, Chao, Tu, Yuhai

论文摘要

概括是深度学习（DL）中最重要的问题之一。在神经网络的过度参数化方案中，存在许多低损坏的解决方案，它们同样适合训练数据。关键问题是哪种解决方案更具普遍性。经验研究表明，溶液处损失景观的平坦度与其普遍性之间存在很强的相关性，而随机梯度下降（SGD）对于寻找平坦溶液至关重要。为了了解SGD如何将学习系统驱动到平坦的解决方案，我们构建了一个简单的模型，其损失景观具有连续的归化（或接近退化）的最小值。通过求解潜在随机学习动力学的Fokker-Planck方程，我们表明，由于其强大的各向异性，SGD噪声引入了一个额外的有效损失项，该损失术语随着平稳性而降低，并且具有整体强度，并且随着学习率和批次对批量变化而增加。我们发现，附加依赖景观的SGD损失破坏了堕落性，并作为寻找平坦溶液的有效正规化。此外，更强的SGD噪声会缩短收敛时间到平坦的溶液。但是，我们确定了系统无法收敛的SGD噪声的上限。我们的结果不仅阐明了SGD在概括中的作用，它们还可能对无差异而有效学习的超参数选择具有重要意义。

Generalization is one of the most important problems in deep learning (DL). In the overparameterized regime in neural networks, there exist many low-loss solutions that fit the training data equally well. The key question is which solution is more generalizable. Empirical studies showed a strong correlation between flatness of the loss landscape at a solution and its generalizability, and stochastic gradient descent (SGD) is crucial in finding the flat solutions. To understand how SGD drives the learning system to flat solutions, we construct a simple model whose loss landscape has a continuous set of degenerate (or near degenerate) minima. By solving the Fokker-Planck equation of the underlying stochastic learning dynamics, we show that due to its strong anisotropy the SGD noise introduces an additional effective loss term that decreases with flatness and has an overall strength that increases with the learning rate and batch-to-batch variation. We find that the additional landscape-dependent SGD-loss breaks the degeneracy and serves as an effective regularization for finding flat solutions. Furthermore, a stronger SGD noise shortens the convergence time to the flat solutions. However, we identify an upper bound for the SGD noise beyond which the system fails to converge. Our results not only elucidate the role of SGD for generalization they may also have important implications for hyperparameter selection for learning efficiently without divergence.

下载PDF全文

下载文献需遵守相关版权规定

论文标题