论文标题
关于内核方法的大型学习率的好处
On the Benefits of Large Learning Rates for Kernel Methods
论文作者
论文摘要
本文研究了一种有趣的现象,与通过在梯度下降算法中使用大型学习率获得的估计剂的良好概括性能有关。我们首先在深度学习文献中观察到,即使由此产生的优化问题是凸面,也可以在内核方法的背景下精确地表征这种现象。具体而言,我们考虑在可分离的希尔伯特空间中最小化二次物镜的最小化,并表明,随着时间的努力,学习率的选择会影响在Hessian的特征向量上获得的溶液的光谱分解。这扩展了Nakkiran(2020)对二维玩具问题描述的直觉,并将其延伸到现实的学习场景,例如内核岭回归。虽然一旦火车和测试目标之间存在不匹配,我们就可以证明大型学习率可能有益,但我们进一步解释了为什么它已经在分类任务中发生,而无需假设火车和测试数据分布之间的任何特定的不匹配。
This paper studies an intriguing phenomenon related to the good generalization performance of estimators obtained by using large learning rates within gradient descent algorithms. First observed in the deep learning literature, we show that a phenomenon can be precisely characterized in the context of kernel methods, even though the resulting optimization problem is convex. Specifically, we consider the minimization of a quadratic objective in a separable Hilbert space, and show that with early stopping, the choice of learning rate influences the spectral decomposition of the obtained solution on the Hessian's eigenvectors. This extends an intuition described by Nakkiran (2020) on a two-dimensional toy problem to realistic learning scenarios such as kernel ridge regression. While large learning rates may be proven beneficial as soon as there is a mismatch between the train and test objectives, we further explain why it already occurs in classification tasks without assuming any particular mismatch between train and test data distributions.