低通滤波SGD，用于在深度学习优化领域中恢复扁平优化

论文标题

低通滤波SGD，用于在深度学习优化领域中恢复扁平优化

Low-Pass Filtering SGD for Recovering Flat Optima in the Deep Learning Optimization Landscape

论文作者

Bisla, Devansh, Wang, Jing, Choromanska, Anna

论文摘要

在本文中，我们研究了深度学习（DL）损失景观周围围绕当地最小值的清晰度，以揭示DL模型概括能力的系统机制。我们的分析是在不同的网络和优化器超参数上进行的，涉及一个具有不同清晰度措施的丰富家族。我们比较了这些措施，并表明基于低通滤波器的测量表现出与DL模型的概括能力最高的相关性，对数据和标签噪声具有很高的鲁棒性，此外可以跟踪神经网络的双重下降行为。接下来，我们依赖于低通滤波器（LPF）来得出优化算法，该算法使用SGD样过程积极搜索DL优化景观中的平面区域。我们称为LPF-SGD的提议算法的更新取决于使用损耗函数的滤波器内核的卷积梯度，并且可以使用MC采样有效地计算。我们从经验上表明，与常见的DL培训策略相比，我们的算法实现了出色的概括性能。在理论方面，我们证明LPF-SGD收敛到比SGD更小的概括误差的更好的最佳点。

In this paper, we study the sharpness of a deep learning (DL) loss landscape around local minima in order to reveal systematic mechanisms underlying the generalization abilities of DL models. Our analysis is performed across varying network and optimizer hyper-parameters, and involves a rich family of different sharpness measures. We compare these measures and show that the low-pass filter-based measure exhibits the highest correlation with the generalization abilities of DL models, has high robustness to both data and label noise, and furthermore can track the double descent behavior for neural networks. We next derive the optimization algorithm, relying on the low-pass filter (LPF), that actively searches the flat regions in the DL optimization landscape using SGD-like procedure. The update of the proposed algorithm, that we call LPF-SGD, is determined by the gradient of the convolution of the filter kernel with the loss function and can be efficiently computed using MC sampling. We empirically show that our algorithm achieves superior generalization performance compared to the common DL training strategies. On the theoretical front, we prove that LPF-SGD converges to a better optimal point with smaller generalization error than SGD.

下载PDF全文

下载文献需遵守相关版权规定

论文标题