扩展分布式深度学习工作负载超出了业力的记忆能力

论文标题

扩展分布式深度学习工作负载超出了业力的记忆能力

Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA

论文作者

Wahib, Mohamed, Zhang, Haoyu, Nguyen, Truong Thao, Drozd, Aleksandr, Domke, Jens, Zhang, Lingqi, Takano, Ryousei, Matsuoka, Satoshi

论文摘要

硬件加速器的专用内存不足以存储大型深度学习模型的所有权重和/或中间状态。尽管模型并行性是减少记忆压力问题的可行方法，但需要对源代码的重大修改和算法的考虑。另一种解决方案是使用核心外方法，而不是使用数据并行性。我们提出了一个基于核心外训练行为的并发分析的绩效模型，并得出了将层交换和冗余重新计算的策略。在最新的核心外方法上，我们在六个不同型号中平均达到1.52倍加速。我们还介绍了第一种方法，以通过仔细管道梯度交换并在主机上执行参数更新来解决核心多节点训练的具有挑战性问题。我们的数据并行核心外解决方案可以在训练大型模型（例如Megatron-lm和Turning-nlg。

The dedicated memory of hardware accelerators can be insufficient to store all weights and/or intermediate states of large deep learning models. Although model parallelism is a viable approach to reduce the memory pressure issue, significant modification of the source code and considerations for algorithms are required. An alternative solution is to use out-of-core methods instead of, or in addition to, data parallelism. We propose a performance model based on the concurrency analysis of out-of-core training behavior, and derive a strategy that combines layer swapping and redundant recomputing. We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods. We also introduce the first method to solve the challenging problem of out-of-core multi-node training by carefully pipelining gradient exchanges and performing the parameter updates on the host. Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.

下载PDF全文

下载文献需遵守相关版权规定

论文标题