论文标题
在非convex世界中SGD的更好理论
Better Theory for SGD in the Nonconvex World
论文作者
论文摘要
大规模的非凸优化问题在现代机器学习中无处不在,在有兴趣解决这些问题的从业者中,随机梯度下降(SGD)统治至高无上。我们重新审查了非凸环境中SGD的分析,并提出了最近引入的预期平滑度假设的新变体,该假设控制了随机梯度的第二刻的行为。我们表明,我们的假设比所有先前工作中的假设更一般,更合理。此外,我们的结果产生了最佳的$ \ MATHCAL {O}(\ VAREPSILON^{ - 4})$用于查找非convex平滑函数的固定点,并恢复最佳$ \ Mathcal {O}(O}(\ varepsilon^{ - 1}),以找到全球求解IS polyak-lif polyak-ligiak-ligiak-ligiak-ligiak-live iS corection。我们比较在凸度下的收敛速率,并证明了关于二次功能增长和凸度下SGD收敛的定理,这可能具有独立的利益。此外,我们在一个框架中进行分析,该框架可以详细研究各种抽样策略和小型尺寸对有限和小型优化问题的影响。我们通过对真实和合成数据的实验来证实我们的理论结果。
Large-scale nonconvex optimization problems are ubiquitous in modern machine learning, and among practitioners interested in solving them, Stochastic Gradient Descent (SGD) reigns supreme. We revisit the analysis of SGD in the nonconvex setting and propose a new variant of the recently introduced expected smoothness assumption which governs the behaviour of the second moment of the stochastic gradient. We show that our assumption is both more general and more reasonable than assumptions made in all prior work. Moreover, our results yield the optimal $\mathcal{O}(\varepsilon^{-4})$ rate for finding a stationary point of nonconvex smooth functions, and recover the optimal $\mathcal{O}(\varepsilon^{-1})$ rate for finding a global solution if the Polyak-Łojasiewicz condition is satisfied. We compare against convergence rates under convexity and prove a theorem on the convergence of SGD under Quadratic Functional Growth and convexity, which might be of independent interest. Moreover, we perform our analysis in a framework which allows for a detailed study of the effects of a wide array of sampling strategies and minibatch sizes for finite-sum optimization problems. We corroborate our theoretical results with experiments on real and synthetic data.