关于分布式学习的有偏压

论文标题

关于分布式学习的有偏压

On Biased Compression for Distributed Learning

论文作者

Beznosikov, Aleksandr, Horváth, Samuel, Richtárik, Peter, Safaryan, Mher

论文摘要

在过去的几年中，各种通信压缩技术已成为必不可少的工具，有助于减轻分布式学习中的通信瓶颈。然而，尽管有偏见的压缩机在实践中经常表现出卓越的性能，而对其进行了众多的无偏压缩机的了解，但对它们的了解很少。在这项工作中，我们研究了三类有偏的压缩算子，其中两个是新的，并且当应用于（随机）梯度下降和分布（随机）梯度下降时它们的性能。我们首次显示有偏置的压缩机可以在单节点和分布式设置中导致线性收敛速率。我们证明，使用错误反馈机制采用的分布的压缩SGD方法享受了千古率$ o \ left（Δl\ exp \ left [ - \ frac {μk} {μk} {Δl} \ right] + \ \ \ \ \ \\ frac {（c +Δd）} {kμ} {kμ} {kμ} $ a $采用压缩，$ L $和$μ$是平滑度和强凸常数，$ c $捕获随机梯度噪声（如果在每个节点上计算完整的梯度，则$ c = 0 $），$ d $捕获了最佳梯度的差异（对于超级型型号而言，$ d = 0 $）。此外，通过对传达梯度的几种合成和经验分布的理论研究，我们阐明了为什么以及多少偏见的压缩机优于其无偏变体。最后，我们提出了一些新的有偏见的压缩机，并具有有希望的理论保证和实践绩效。

In the last few years, various communication compression techniques have emerged as an indispensable tool helping to alleviate the communication bottleneck in distributed learning. However, despite the fact biased compressors often show superior performance in practice when compared to the much more studied and understood unbiased compressors, very little is known about them. In this work we study three classes of biased compression operators, two of which are new, and their performance when applied to (stochastic) gradient descent and distributed (stochastic) gradient descent. We show for the first time that biased compressors can lead to linear convergence rates both in the single node and distributed settings. We prove that distributed compressed SGD method, employed with error feedback mechanism, enjoys the ergodic rate $O\left( δL \exp \left[-\frac{μK}{δL}\right] + \frac{(C + δD)}{Kμ}\right)$, where $δ\ge 1$ is a compression parameter which grows when more compression is applied, $L$ and $μ$ are the smoothness and strong convexity constants, $C$ captures stochastic gradient noise ($C=0$ if full gradients are computed on each node) and $D$ captures the variance of the gradients at the optimum ($D=0$ for over-parameterized models). Further, via a theoretical study of several synthetic and empirical distributions of communicated gradients, we shed light on why and by how much biased compressors outperform their unbiased variants. Finally, we propose several new biased compressors with promising theoretical guarantees and practical performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题