缓慢而陈旧的渐变可以赢得比赛

论文标题

缓慢而陈旧的渐变可以赢得比赛

Slow and Stale Gradients Can Win the Race

论文作者

Dutta, Sanghamitra, Wang, Jianyu, Joshi, Gauri

论文摘要

分布式随机梯度下降（SGD）以同步方式运行时，在等待最慢的工人（stragglers）时会遭受延迟。异步方法可以减轻散乱者，但会导致梯度稳定性，从而对收敛误差产生不利影响。在这项工作中，我们通过分析训练有素的模型中的误差与实际训练运行时（WallClock时间）之间的权衡，介绍了通过异步方法提供的速度的新颖理论表征。我们工作的主要新颖性是我们的运行时分析考虑了随机的散布延迟，这有助于我们设计和比较分布式的SGD算法，这些算法在散落和稳定之间取得了平衡。我们还提供了无界或指数延迟假设的异步SGD变体的新错误收敛分析。最后，基于我们对误差折衷的理论表征，我们提出了一种在分布式SGD中逐渐改变同步性的方法，并在CIFAR10数据集上演示了其性能。

Distributed Stochastic Gradient Descent (SGD) when run in a synchronous manner, suffers from delays in runtime as it waits for the slowest workers (stragglers). Asynchronous methods can alleviate stragglers, but cause gradient staleness that can adversely affect the convergence error. In this work, we present a novel theoretical characterization of the speedup offered by asynchronous methods by analyzing the trade-off between the error in the trained model and the actual training runtime(wallclock time). The main novelty in our work is that our runtime analysis considers random straggling delays, which helps us design and compare distributed SGD algorithms that strike a balance between straggling and staleness. We also provide a new error convergence analysis of asynchronous SGD variants without bounded or exponential delay assumptions. Finally, based on our theoretical characterization of the error-runtime trade-off, we propose a method of gradually varying synchronicity in distributed SGD and demonstrate its performance on CIFAR10 dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题