论文标题
关于噪声在随机梯度下降中的概括益处
On the Generalization Benefit of Noise in Stochastic Gradient Descent
论文作者
论文摘要
长期以来一直认为,在深神经网络中,Minibatch随机梯度下降可以比大批量梯度下降更好地概括。然而,最近的论文质疑这一说法,认为这种效果仅仅是当批处理大小较大时次优的高参数调整或计算预算不足的结果。在本文中,我们对一系列流行的型号进行了精心设计的实验和严格的超参数扫描,这些模型验证了小型或中等大的批量大小可以在测试集中大大优于非常大的批次。即使对两种型号进行了相同数量的迭代训练,并且大批量也会达到较小的训练损失,也会发生这种情况。我们的结果证实,随机梯度中的噪声可以增强概括。我们研究最佳学习率计划如何随着时期预算的增长而变化,并根据SGD动力学的随机微分方程观点提供了对观察结果的理论说明。
It has long been argued that minibatch stochastic gradient descent can generalize better than large batch gradient descent in deep neural networks. However recent papers have questioned this claim, arguing that this effect is simply a consequence of suboptimal hyperparameter tuning or insufficient compute budgets when the batch size is large. In this paper, we perform carefully designed experiments and rigorous hyperparameter sweeps on a range of popular models, which verify that small or moderately large batch sizes can substantially outperform very large batches on the test set. This occurs even when both models are trained for the same number of iterations and large batches achieve smaller training losses. Our results confirm that the noise in stochastic gradients can enhance generalization. We study how the optimal learning rate schedule changes as the epoch budget grows, and we provide a theoretical account of our observations based on the stochastic differential equation perspective of SGD dynamics.