论文标题

沟通高效的梯度编码,用于分布式学习中的散乱缓解措施

Communication-Efficient Gradient Coding for Straggler Mitigation in Distributed Learning

论文作者

Kadhe, Swanand, Koyluoglu, O. Ozan, Ramchandran, Kannan

论文摘要

基于梯度的方法的分布式实现,其中服务器在工具机上分发梯度计算,需要克服两个限制:延迟由慢速运行机器造成的延迟,称为“ Stragglers”和通信开销。最近,Ye和Abbe [ICML 2018]提出了一个编码理论范式,以表征每个工人每个工人的计算负载,每个工人的通信开销和Straggler的公差之间的基本权衡。但是,他们提出的编码方案遭受了严重的解码复杂性和数值稳定性差。在本文中,我们开发了一个沟通高效的梯度编码框架来克服这些缺点。我们提出的框架可以使用任何线性代码来设计编码和解码功能。当在此框架中使用特定代码时,其块长度决定了计算负载,尺寸确定了开销的交流,最小距离决定了散乱的公差。选择代码的灵活性使我们能够优雅地权衡Straggler的阈值和通信开销,从而获得较小的解码复杂性和较高的数值稳定性。此外,我们表明,使用框架中随机高斯矩阵生成的最大距离(MDS)代码产生的梯度代码相对于权衡而言是最佳的梯度代码,此外,与先前建议的计划相比,还可以满足数值稳定性的更强保证。最后,我们评估了我们在亚马逊EC2上提出的框架,并证明与先前的梯度编码方案相比,它将平均迭代时间降低了16%。

Distributed implementations of gradient-based methods, wherein a server distributes gradient computations across worker machines, need to overcome two limitations: delays caused by slow running machines called 'stragglers', and communication overheads. Recently, Ye and Abbe [ICML 2018] proposed a coding-theoretic paradigm to characterize a fundamental trade-off between computation load per worker, communication overhead per worker, and straggler tolerance. However, their proposed coding schemes suffer from heavy decoding complexity and poor numerical stability. In this paper, we develop a communication-efficient gradient coding framework to overcome these drawbacks. Our proposed framework enables using any linear code to design the encoding and decoding functions. When a particular code is used in this framework, its block-length determines the computation load, dimension determines the communication overhead, and minimum distance determines the straggler tolerance. The flexibility of choosing a code allows us to gracefully trade-off the straggler threshold and communication overhead for smaller decoding complexity and higher numerical stability. Further, we show that using a maximum distance separable (MDS) code generated by a random Gaussian matrix in our framework yields a gradient code that is optimal with respect to the trade-off and, in addition, satisfies stronger guarantees on numerical stability as compared to the previously proposed schemes. Finally, we evaluate our proposed framework on Amazon EC2 and demonstrate that it reduces the average iteration time by 16% as compared to prior gradient coding schemes.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源