论文标题

使用激活压缩并保证,通过慢速网络上的微调语言模型

Fine-tuning Language Models over Slow Networks using Activation Compression with Guarantees

论文作者

Wang, Jue, Yuan, Binhang, Rimanic, Luka, He, Yongjun, Dao, Tri, Chen, Beidi, Re, Christopher, Zhang, Ce

论文摘要

沟通压缩是现代分布式学习系统的至关重要技术,可以减轻网络较慢的沟通瓶颈。尽管最近对数据并行式训练的梯度压缩进行了深入的研究,但通过管道并行性训练的模型的激活仍然是一个空旷的问题。在本文中,我们提出了AC-SGD,这是一种新型的激活压缩算法,用于在慢速网络上进行通信有效的管道并行性训练。与以前在激活压缩方面的努力不同,而不是直接压缩激活值,而是压缩激活的变化。这使我们能够第一次表明我们的知识,仍然可以实现$ o(1/\ sqrt {t})$融合率,即激活压缩的非convex目标的融合率,而无需对梯度无偏见的假设,而无需在不符合非线性激活的情况下,并且无法表现出非线性的激活功能。我们将AC-SGD评估为具有15亿个参数的AC-SGD到微调语言模型,将激活压缩至2-4位。AC-SGD在较慢的网络中最多可提供4.3倍的端到端端到端,而无需牺牲模型质量。此外,我们还表明,AC-SGD可以与最先进的梯度压缩算法结合使用,以实现“端到端通信压缩:机器之间的所有通信,包括模型梯度,正向激活和向后梯度,都将压缩到较低的精度中。这最多可提供4.9倍的端到端到端的快速加速,而无需牺牲模型质量。

Communication compression is a crucial technique for modern distributed learning systems to alleviate their communication bottlenecks over slower networks. Despite recent intensive studies of gradient compression for data parallel-style training, compressing the activations for models trained with pipeline parallelism is still an open problem. In this paper, we propose AC-SGD, a novel activation compression algorithm for communication-efficient pipeline parallelism training over slow networks. Different from previous efforts in activation compression, instead of compressing activation values directly, AC-SGD compresses the changes of the activations. This allows us to show, to the best of our knowledge for the first time, that one can still achieve $O(1/\sqrt{T})$ convergence rate for non-convex objectives under activation compression, without making assumptions on gradient unbiasedness that do not hold for deep learning models with non-linear activation functions.We then show that AC-SGD can be optimized and implemented efficiently, without additional end-to-end runtime overhead.We evaluated AC-SGD to fine-tune language models with up to 1.5 billion parameters, compressing activations to 2-4 bits.AC-SGD provides up to 4.3X end-to-end speed-up in slower networks, without sacrificing model quality. Moreover, we also show that AC-SGD can be combined with state-of-the-art gradient compression algorithms to enable "end-to-end communication compression: All communications between machines, including model gradients, forward activations, and backward gradients are compressed into lower precision.This provides up to 4.9X end-to-end speed-up, without sacrificing model quality.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源