论文标题

NONCONVEX SGD的尖锐率和灵活的框架,并带有客户和数据采样

Sharper Rates and Flexible Framework for Nonconvex SGD with Client and Data Sampling

论文作者

Tyurin, Alexander, Sun, Lukang, Burlachenko, Konstantin, Richtárik, Peter

论文摘要

我们重新审查了查找$ n $平滑和可能的非凸功能的大约固定点的经典问题。随机一阶方法的最佳复杂性在单个功能的梯度评估数量方面是$ \ Mathcal {o} \ left(n + n^{1/2} \ varepsilon^{ - 1}} \ right),由最佳sgd方法实现$ \ small \ sf \ color {green} {spider} $(arxiv:1807.01695)和$ \ small \ sf \ sf \ sf \ color {green} {page} $(arxiv:2008.10898),例如,其中$ \ varepsilon $是错误的差异。但是,i)大 - $ \ natercal {o} $表示法隐藏了对与功能相关的平滑度常数的关键依赖性,ii)这些方法中的速率和理论假设没有任何灵活性的简单抽样机制。在这项工作中,我们补救了情况。首先,我们将$ \ small \ sf \ color {green} {page} $算法概括,以便可以证明它几乎可以使用任何(无偏见的)采样机制。这在联合学习中特别有用,因为它使我们能够构建和更好地了解客户和数据采样策略的各种组合的影响。其次,当我们明确使用某些新型不平等现象时,我们的分析变得更加清晰,这些不平等现象捕获了平滑度常数与采样程序之间的复杂相互作用。实际上,即使对于在$ \ small \ sf \ color {green} {page} $ paper中分析的简单采样过程中,我们的分析也更好。但是,我们建议的不同采样方案可以进一步削减这种改善的结合。总而言之,我们在平滑的非凸制度中提供了最佳最佳SGD的最通用分析。最后,我们的理论发现是通过精心设计的实验所假定的。

We revisit the classical problem of finding an approximately stationary point of the average of $n$ smooth and possibly nonconvex functions. The optimal complexity of stochastic first-order methods in terms of the number of gradient evaluations of individual functions is $\mathcal{O}\left(n + n^{1/2}\varepsilon^{-1}\right)$, attained by the optimal SGD methods $\small\sf\color{green}{SPIDER}$(arXiv:1807.01695) and $\small\sf\color{green}{PAGE}$(arXiv:2008.10898), for example, where $\varepsilon$ is the error tolerance. However, i) the big-$\mathcal{O}$ notation hides crucial dependencies on the smoothness constants associated with the functions, and ii) the rates and theory in these methods assume simplistic sampling mechanisms that do not offer any flexibility. In this work we remedy the situation. First, we generalize the $\small\sf\color{green}{PAGE}$ algorithm so that it can provably work with virtually any (unbiased) sampling mechanism. This is particularly useful in federated learning, as it allows us to construct and better understand the impact of various combinations of client and data sampling strategies. Second, our analysis is sharper as we make explicit use of certain novel inequalities that capture the intricate interplay between the smoothness constants and the sampling procedure. Indeed, our analysis is better even for the simple sampling procedure analyzed in the $\small\sf\color{green}{PAGE}$ paper. However, this already improved bound can be further sharpened by a different sampling scheme which we propose. In summary, we provide the most general and most accurate analysis of optimal SGD in the smooth nonconvex regime. Finally, our theoretical findings are supposed with carefully designed experiments.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源