论文标题
有效的管道计划,用于加急分布式DNN培训
Efficient Pipeline Planning for Expedited Distributed DNN Training
论文作者
论文摘要
为了培训现代大型DNN模型,最近出现了管道并行性,该模型在GPU上分布了该模型,并使不同的设备能够处理管道中的不同微观匹配。较早的管道设计允许多种版本的模型参数共存(类似于异步训练),并且不能确保与没有管道的相同模型收敛性和准确性性能。最近提出了同步管道,该管道通过在训练迭代之间执行同步屏障来确保模型性能。但是,同步屏障需要等待所有微匹配的梯度聚集,从而延迟了训练进度。需要优化的管道计划来最大程度地减少等待时间,从而使培训时间在文献中尚未得到很好的研究。本文设计了有效的,近乎最佳的算法,用于加快与任意GPU连接性的现代大型DNN的同步管道平行训练。我们的算法框架包括两个组件:管道分区和设备映射算法,以及一个决定在分区上微匹配的处理顺序的管道调度程序,共同将每卷训练时间最小化。我们进行了彻底的理论分析,广泛的测试床实验和痕量驱动的模拟,并证明我们的计划可以加速训练高达157%,而与最新的设计相比。
To train modern large DNN models, pipeline parallelism has recently emerged, which distributes the model across GPUs and enables different devices to process different microbatches in pipeline. Earlier pipeline designs allow multiple versions of model parameters to co-exist (similar to asynchronous training), and cannot ensure the same model convergence and accuracy performance as without pipelining. Synchronous pipelining has recently been proposed which ensures model performance by enforcing a synchronization barrier between training iterations. Nonetheless, the synchronization barrier requires waiting for gradient aggregation from all microbatches and thus delays the training progress. Optimized pipeline planning is needed to minimize such wait and hence the training time, which has not been well studied in the literature. This paper designs efficient, near-optimal algorithms for expediting synchronous pipeline-parallel training of modern large DNNs over arbitrary inter-GPU connectivity. Our algorithm framework comprises two components: a pipeline partition and device mapping algorithm, and a pipeline scheduler that decides processing order of microbatches over the partitions, which together minimize the per-iteration training time. We conduct thorough theoretical analysis, extensive testbed experiments and trace-driven simulation, and demonstrate our scheme can accelerate training up to 157% compared with state-of-the-art designs.