使用激活压缩并保证，通过慢速网络上的微调语言模型

论文标题

使用激活压缩并保证，通过慢速网络上的微调语言模型

Fine-tuning Language Models over Slow Networks using Activation Compression with Guarantees

论文作者

Wang, Jue, Yuan, Binhang, Rimanic, Luka, He, Yongjun, Dao, Tri, Chen, Beidi, Re, Christopher, Zhang, Ce

论文摘要

沟通压缩是现代分布式学习系统的至关重要技术，可以减轻网络较慢的沟通瓶颈。尽管最近对数据并行式训练的梯度压缩进行了深入的研究，但通过管道并行性训练的模型的激活仍然是一个空旷的问题。在本文中，我们提出了AC-SGD，这是一种新型的激活压缩算法，用于在慢速网络上进行通信有效的管道并行性训练。与以前在激活压缩方面的努力不同，而不是直接压缩激活值，而是压缩激活的变化。这使我们能够第一次表明我们的知识，仍然可以实现$ o（1/\ sqrt {t}）$融合率，即激活压缩的非convex目标的融合率，而无需对梯度无偏见的假设，而无需在不符合非线性激活的情况下，并且无法表现出非线性的激活功能。我们将AC-SGD评估为具有15亿个参数的AC-SGD到微调语言模型，将激活压缩至2-4位。AC-SGD在较慢的网络中最多可提供4.3倍的端到端端到端，而无需牺牲模型质量。此外，我们还表明，AC-SGD可以与最先进的梯度压缩算法结合使用，以实现“端到端通信压缩：机器之间的所有通信，包括模型梯度，正向激活和向后梯度，都将压缩到较低的精度中。这最多可提供4.9倍的端到端到端的快速加速，而无需牺牲模型质量。

Communication compression is a crucial technique for modern distributed learning systems to alleviate their communication bottlenecks over slower networks. Despite recent intensive studies of gradient compression for data parallel-style training, compressing the activations for models trained with pipeline parallelism is still an open problem. In this paper, we propose AC-SGD, a novel activation compression algorithm for communication-efficient pipeline parallelism training over slow networks. Different from previous efforts in activation compression, instead of compressing activation values directly, AC-SGD compresses the changes of the activations. This allows us to show, to the best of our knowledge for the first time, that one can still achieve $O(1/\sqrt{T})$ convergence rate for non-convex objectives under activation compression, without making assumptions on gradient unbiasedness that do not hold for deep learning models with non-linear activation functions.We then show that AC-SGD can be optimized and implemented efficiently, without additional end-to-end runtime overhead.We evaluated AC-SGD to fine-tune language models with up to 1.5 billion parameters, compressing activations to 2-4 bits.AC-SGD provides up to 4.3X end-to-end speed-up in slower networks, without sacrificing model quality. Moreover, we also show that AC-SGD can be combined with state-of-the-art gradient compression algorithms to enable "end-to-end communication compression: All communications between machines, including model gradients, forward activations, and backward gradients are compressed into lower precision.This provides up to 4.9X end-to-end speed-up, without sacrificing model quality.

下载PDF全文

下载文献需遵守相关版权规定

论文标题