Pytorch分发：加速数据并行培训的经验

论文标题

Pytorch分发：加速数据并行培训的经验

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

论文作者

Li, Shen, Zhao, Yanli, Varma, Rohan, Salpekar, Omkar, Noordhuis, Pieter, Li, Teng, Paszke, Adam, Smith, Jeff, Vaughan, Brian, Damania, Pritam, Chintala, Soumith

论文摘要

本文介绍了Pytorch分布式数据并行模块的设计，实现和评估。 Pytorch是一种广泛的经过深入学习研究和应用中的科学计算套件。深度学习的最新进展主张了大型数据集和大型模型的价值，这需要能够将模型培训扩展到更多的计算资源。由于其直接的原理和广泛的适用性，数据并行性已成为分布式培训的流行解决方案。通常，分布式数据并行性的技术在每个计算资源上复制模型，以独立生成梯度，然后在每次迭代中传达这些梯度，以使模型复制品保持一致。尽管该技术具有概念上的简单性，但计算和通信之间的微妙依赖性使得优化分布式训练效率是不平凡的。截至v1.5，Pytorch本地提供了几种平行分布式数据并行加速数据的技术，包括存储梯度，与通信重叠的计算以及跳过梯度同步。评估表明，正确配置时，Pytorch分布式数据并行模块使用256 GPU达到接近线性的可伸缩性。

This paper presents the design, implementation, and evaluation of the PyTorch distributed data parallel module. PyTorch is a widely-adopted scientific computing package used in deep learning research and applications. Recent advances in deep learning argue for the value of large datasets and large models, which necessitates the ability to scale out model training to more computational resources. Data parallelism has emerged as a popular solution for distributed training thanks to its straightforward principle and broad applicability. In general, the technique of distributed data parallelism replicates the model on every computational resource to generate gradients independently and then communicates those gradients at each iteration to keep model replicas consistent. Despite the conceptual simplicity of the technique, the subtle dependencies between computation and communication make it non-trivial to optimize the distributed training efficiency. As of v1.5, PyTorch natively provides several techniques to accelerate distributed data parallel, including bucketing gradients, overlapping computation with communication, and skipping gradient synchronization. Evaluations show that, when configured appropriately, the PyTorch distributed data parallel module attains near-linear scalability using 256 GPUs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题