SGD的随机Polyak台阶：快速收敛的自适应学习率

论文标题

SGD的随机Polyak台阶：快速收敛的自适应学习率

Stochastic Polyak Step-size for SGD: An Adaptive Learning Rate for Fast Convergence

论文作者

Loizou, Nicolas, Vaswani, Sharan, Laradji, Issam, Lacoste-Julien, Simon

论文摘要

我们提出了在亚级别方法中常用的经典polyak阶梯尺寸（Polyak，1987）的随机变体。尽管计算polyak台阶需要了解最佳功能值，但此信息很容易用于典型的现代机器学习应用程序。因此，提出的随机多亚阶梯尺寸（SP）是设定随机梯度下降（SGD）学习率的有吸引力的选择。我们为在不同设置中配备SPS的SGD提供了理论收敛保证，包括强烈凸，凸和非凸功能。此外，我们的分析可导致新颖的收敛性保证SGD具有恒定的步进尺寸。我们表明，在训练能够插值训练数据的过度参数化模型时，SPS特别有效。在这种情况下，我们证明SPS使SGD可以快速收敛到真实的解决方案，而无需了解任何与问题有关的常数或其他计算开销。我们通过对合成和真实数据集进行广泛的实验来实验验证我们的理论结果。与训练过度参数化模型相比，我们证明了SGD与SPS相比的强大性能。

We propose a stochastic variant of the classical Polyak step-size (Polyak, 1987) commonly used in the subgradient method. Although computing the Polyak step-size requires knowledge of the optimal function values, this information is readily available for typical modern machine learning applications. Consequently, the proposed stochastic Polyak step-size (SPS) is an attractive choice for setting the learning rate for stochastic gradient descent (SGD). We provide theoretical convergence guarantees for SGD equipped with SPS in different settings, including strongly convex, convex and non-convex functions. Furthermore, our analysis results in novel convergence guarantees for SGD with a constant step-size. We show that SPS is particularly effective when training over-parameterized models capable of interpolating the training data. In this setting, we prove that SPS enables SGD to converge to the true solution at a fast rate without requiring the knowledge of any problem-dependent constants or additional computational overhead. We experimentally validate our theoretical results via extensive experiments on synthetic and real datasets. We demonstrate the strong performance of SGD with SPS compared to state-of-the-art optimization methods when training over-parameterized models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题