训练神经网络中极端数值敏感性和稳定边缘的基于PDE的解释

论文标题

训练神经网络中极端数值敏感性和稳定边缘的基于PDE的解释

A PDE-based Explanation of Extreme Numerical Sensitivities and Edge of Stability in Training Neural Networks

论文作者

Sun, Yuxin, Lao, Dong, Sundaramoorthi, Ganesh, Yezzi, Anthony

论文摘要

我们发现了具有随机梯度下降（SGD）及其变体的深网的当前培训实践中的数值不稳定性。我们显示了数值错误（在最小的浮点位的顺序上，因此可以显着放大训练深网中浮点算术引起的最极端或极限的数值扰动，并导致显着的测试准确性差异（敏感性），与SGD的稳定性相比，与测试的准确性相比，我们表现出了SGD的限制。对量张量的迭代和区域进行了本地化，我们通过对部分微分方程（PDE）进行理论框架，并分析卷积神经网络的梯度下降证明这是与CNN的梯度下降相关的非线性PDE的结果，在过度驱动离散的步骤时，局部线性变化，导致稳定效应，我们将稳定的不稳定性与最近发现的稳定性（EOS）现象相关联。 EOS，我们的理论提供了有关EOS的新见解和预测，特别是正则化的作用以及对网络复杂性的依赖性。

We discover restrained numerical instabilities in current training practices of deep networks with stochastic gradient descent (SGD), and its variants. We show numerical error (on the order of the smallest floating point bit and thus the most extreme or limiting numerical perturbations induced from floating point arithmetic in training deep nets can be amplified significantly and result in significant test accuracy variance (sensitivities), comparable to the test accuracy variance due to stochasticity in SGD. We show how this is likely traced to instabilities of the optimization dynamics that are restrained, i.e., localized over iterations and regions of the weight tensor space. We do this by presenting a theoretical framework using numerical analysis of partial differential equations (PDE), and analyzing the gradient descent PDE of convolutional neural networks (CNNs). We show that it is stable only under certain conditions on the learning rate and weight decay. We show that rather than blowing up when the conditions are violated, the instability can be restrained. We show this is a consequence of the non-linear PDE associated with the gradient descent of the CNN, whose local linearization changes when over-driving the step size of the discretization, resulting in a stabilizing effect. We link restrained instabilities to the recently discovered Edge of Stability (EoS) phenomena, in which the stable step size predicted by classical theory is exceeded while continuing to optimize the loss and still converging. Because restrained instabilities occur at the EoS, our theory provides new insights and predictions about the EoS, in particular, the role of regularization and the dependence on the network complexity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题