论文标题

SGD中的重尾现象

The Heavy-Tail Phenomenon in SGD

论文作者

Gurbuzbalaban, Mert, Şimşekli, Umut, Zhu, Lingjiong

论文摘要

近年来,已经提出了各种能力和复杂性的概念,以表征深度学习中随机梯度下降(SGD)的泛化特性。与看不见的数据相关的一些流行概念是(i)SGD发现的局部最低限度的“平坦度”,这与Hessian的特征值有关,(ii)STEPISE $η$的比率与批次级别$ b $相关的比率,这本质上可以测量到范围的幅度,该级别的幅度是(III),(II III)(III),(iii''(II III),(iii''(III)''(III)'''''''''网络权重的尾部的重度。在本文中,我们认为,这三个看似无关的概括观点彼此之间有着深远的联系。我们声称,根据最低损失的Hessian结构,以及算法参数$η$和$ b $的选择,SGD迭代将收敛到\ Emph {重型尾巴}固定分布。我们严格地在二次优化的设置中证明了这一主张:我们表明,即使在简单的线性回归问题中具有独立和相同分布的数据,其分布的分布具有有限的所有顺序,迭代物也可以通过无限差异进行重尾。我们进一步表征了尾部相对于算法参数,维度和曲率的行为。然后,我们将结果转化为有关SGD在深度学习中行为的见解。我们通过对合成数据,完全连接和卷积神经网络进行的实验来支持我们的理论。

In recent years, various notions of capacity and complexity have been proposed for characterizing the generalization properties of stochastic gradient descent (SGD) in deep learning. Some of the popular notions that correlate well with the performance on unseen data are (i) the `flatness' of the local minimum found by SGD, which is related to the eigenvalues of the Hessian, (ii) the ratio of the stepsize $η$ to the batch-size $b$, which essentially controls the magnitude of the stochastic gradient noise, and (iii) the `tail-index', which measures the heaviness of the tails of the network weights at convergence. In this paper, we argue that these three seemingly unrelated perspectives for generalization are deeply linked to each other. We claim that depending on the structure of the Hessian of the loss at the minimum, and the choices of the algorithm parameters $η$ and $b$, the SGD iterates will converge to a \emph{heavy-tailed} stationary distribution. We rigorously prove this claim in the setting of quadratic optimization: we show that even in a simple linear regression problem with independent and identically distributed data whose distribution has finite moments of all order, the iterates can be heavy-tailed with infinite variance. We further characterize the behavior of the tails with respect to algorithm parameters, the dimension, and the curvature. We then translate our results into insights about the behavior of SGD in deep learning. We support our theory with experiments conducted on synthetic data, fully connected, and convolutional neural networks.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源