论文标题
第二种观察指数和余弦的步骤大小:简单,适应性和性能
A Second look at Exponential and Cosine Step Sizes: Simplicity, Adaptivity, and Performance
论文作者
论文摘要
随机梯度下降(SGD)是训练大规模机器学习模型的流行工具。但是,它的性能是高度可变的,这取决于步骤尺寸的选择。因此,已经提出了各种调整步骤尺寸的策略,从坐标的方法(又称“自适应”步骤尺寸)到复杂的启发式方法,以改变每次迭代的步骤大小。在本文中,我们研究了两个步骤的时间表,它们的功率在实践中反复确认:指数和余弦的步骤尺寸。我们首次为他们提供理论支持,以证明有和没有Polyak-lojasiewicz(PL)条件的平滑非凸功能的收敛速率。此外,我们表明了令人惊讶的属性,即这两种策略是\ emph {自适应}到PL函数随机梯度中的噪声水平。也就是说,与多项式步长相反,它们几乎达到了最佳的性能,而无需知道噪声水平或基于它调整其超参数。最后,我们对具有深度学习体系结构的现实世界数据集进行了公平而全面的经验评估。结果表明,即使最多需要两个超级参数来调整这两种策略,或者最适合各种精心调整的最先进策略的性能。
Stochastic Gradient Descent (SGD) is a popular tool in training large-scale machine learning models. Its performance, however, is highly variable, depending crucially on the choice of the step sizes. Accordingly, a variety of strategies for tuning the step sizes have been proposed, ranging from coordinate-wise approaches (a.k.a. ``adaptive'' step sizes) to sophisticated heuristics to change the step size in each iteration. In this paper, we study two step size schedules whose power has been repeatedly confirmed in practice: the exponential and the cosine step sizes. For the first time, we provide theoretical support for them proving convergence rates for smooth non-convex functions, with and without the Polyak-Łojasiewicz (PL) condition. Moreover, we show the surprising property that these two strategies are \emph{adaptive} to the noise level in the stochastic gradients of PL functions. That is, contrary to polynomial step sizes, they achieve almost optimal performance without needing to know the noise level nor tuning their hyperparameters based on it. Finally, we conduct a fair and comprehensive empirical evaluation of real-world datasets with deep learning architectures. Results show that, even if only requiring at most two hyperparameters to tune, these two strategies best or match the performance of various finely-tuned state-of-the-art strategies.