论文标题

自适应周期性平均:一种减少分布式学习沟通的实用方法

Adaptive Periodic Averaging: A Practical Approach to Reducing Communication in Distributed Learning

论文作者

Jiang, Peng, Agrawal, Gagan

论文摘要

随机梯度下降(SGD)是许多机器学习任务的关键学习算法。由于其计算成本,人们越来越有兴趣加速SGD在HPC资源(如GPU群集)上。但是,即使机器之间的快速连接,平行SGD的性能仍然受到高沟通成本的瓶颈。减轻此问题的一种简单方法,用于许多现有努力,是使用持续平均时期进行每次迭代。在本文中,我们表明,从融合和通信成本方面的最佳平均时间不是一个恒定,而是在执行过程中有所不同。具体而言,我们观察到,在计算节点之间降低模型参数的方差对于周期参数平均SGD的收敛至关重要。鉴于固定的通信预算,我们表明,在早期迭代中更频繁地同步以减少最初的较大差异和在培训过程的后期频率较低的频率更少是有益的。我们提出了一种实用算法,称为自适应周期参数平均SGD(ADPSGD),以实现模型参数的总体差异较小,因此与恒定周期参数平均SGD(CPSGD)相比,收敛更好。我们通过几个图像分类基准评估了我们的方法,并表明我们的ADPSGD确实实现了较小的训练损失和与CPSGD相比较小的通信的较高的测试精度。与梯度定量SGD相比,我们表明我们的算法仅与一半的通信实现了更快的收敛速度。与全通信SGD相比,我们的ADPSGD在计算节点之间具有100gbps连接的1:14倍至1:27倍的速度,并且速度增加到1:46x〜1:95x,并具有10Gbps的连接。

Stochastic Gradient Descent (SGD) is the key learning algorithm for many machine learning tasks. Because of its computational costs, there is a growing interest in accelerating SGD on HPC resources like GPU clusters. However, the performance of parallel SGD is still bottlenecked by the high communication costs even with a fast connection among the machines. A simple approach to alleviating this problem, used in many existing efforts, is to perform communication every few iterations, using a constant averaging period. In this paper, we show that the optimal averaging period in terms of convergence and communication cost is not a constant, but instead varies over the course of the execution. Specifically, we observe that reducing the variance of model parameters among the computing nodes is critical to the convergence of periodic parameter averaging SGD. Given a fixed communication budget, we show that it is more beneficial to synchronize more frequently in early iterations to reduce the initial large variance and synchronize less frequently in the later phase of the training process. We propose a practical algorithm, named ADaptive Periodic parameter averaging SGD (ADPSGD), to achieve a smaller overall variance of model parameters, and thus better convergence compared with the Constant Periodic parameter averaging SGD (CPSGD). We evaluate our method with several image classification benchmarks and show that our ADPSGD indeed achieves smaller training losses and higher test accuracies with smaller communication compared with CPSGD. Compared with gradient-quantization SGD, we show that our algorithm achieves faster convergence with only half of the communication. Compared with full-communication SGD, our ADPSGD achieves 1:14x to 1:27x speedups with a 100Gbps connection among computing nodes, and the speedups increase to 1:46x ~ 1:95x with a 10Gbps connection.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源