论文标题

Proxskip:是的!当地梯度步骤证明会导致沟通加速!最后!

ProxSkip: Yes! Local Gradient Steps Provably Lead to Communication Acceleration! Finally!

论文作者

Mishchenko, Konstantin, Malinovsky, Grigory, Stich, Sebastian, Richtárik, Peter

论文摘要

我们介绍了Proxskip-一种令人惊讶的简单且可证明的方法,用于最大程度地减少光滑($ f $)的总和($ f $)和昂贵的非鞋($ψ$)功能。解决此类问题的规范方法是通过近端梯度下降(ProxGD)算法,该算法基于对$ f $的梯度的评估和每种迭代中$ψ$的代理操作员。在这项工作中,我们对代理评估相对于梯度的评估而昂贵的制度特别感兴趣,在许多应用中,情况就是如此。 Proxskip允许在大多数迭代中跳过昂贵的代理操作员:虽然其迭代复杂性为$ \ Mathcal {o} \ left(κ\ log \ frac {1} {\ varepsilon} \ right)$,如果$κ$是$ f $ f $ f $ f $ f $ f $ f $ f $ f $ f $ f $ f $,则$ \ MATHCAL {O} \ left(\sqrtκ\ log \ frac {1} {\ varepsilon} \ right)$。我们的主要动机来自联邦学习,在该学习中,对梯度操作员的评估对应于在所有设备上独立采用本地GD步骤,而对代理的评估对应于梯度平均形式的(昂贵)通信。在这种情况下,Proxskip提供了有效的沟通复杂性加速。与其他局部梯度类型的方法,例如FedAvg,脚手架,S-Local-GD和Fedlin不同,其理论交流复杂性比异质数据制度中的Vanilla GD的理论沟通复杂性更差,我们在异质数据方面具有可证明的可证明的,并且没有任何有异质性的假设。

We introduce ProxSkip -- a surprisingly simple and provably efficient method for minimizing the sum of a smooth ($f$) and an expensive nonsmooth proximable ($ψ$) function. The canonical approach to solving such problems is via the proximal gradient descent (ProxGD) algorithm, which is based on the evaluation of the gradient of $f$ and the prox operator of $ψ$ in each iteration. In this work we are specifically interested in the regime in which the evaluation of prox is costly relative to the evaluation of the gradient, which is the case in many applications. ProxSkip allows for the expensive prox operator to be skipped in most iterations: while its iteration complexity is $\mathcal{O}\left(κ\log \frac{1}{\varepsilon}\right)$, where $κ$ is the condition number of $f$, the number of prox evaluations is $\mathcal{O}\left(\sqrtκ \log \frac{1}{\varepsilon}\right)$ only. Our main motivation comes from federated learning, where evaluation of the gradient operator corresponds to taking a local GD step independently on all devices, and evaluation of prox corresponds to (expensive) communication in the form of gradient averaging. In this context, ProxSkip offers an effective acceleration of communication complexity. Unlike other local gradient-type methods, such as FedAvg, SCAFFOLD, S-Local-GD and FedLin, whose theoretical communication complexity is worse than, or at best matching, that of vanilla GD in the heterogeneous data regime, we obtain a provable and large improvement without any heterogeneity-bounding assumptions.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源