Scala：通过有效的大批量对抗噪声加速预训练的基于变压器的语言模型

论文标题

Scala：通过有效的大批量对抗噪声加速预训练的基于变压器的语言模型

ScaLA: Accelerating Adaptation of Pre-Trained Transformer-Based Language Models via Efficient Large-Batch Adversarial Noise

论文作者

Zhang, Minjia, Naresh, Niranjan Uma, He, Yuxiong

论文摘要

近年来，大型预训练的基于变压器的语言模型已导致许多自然语言理解任务的巨大改进。为了增加尺寸的训练这些模型，许多神经网络从业人员试图增加批量的大小，以利用多个GPU来提高训练速度。但是，增加批量的大小通常会使优化更加困难，从而导致收敛缓慢或泛化，这可能需要更多的数量级训练时间才能达到相同的模型质量。在本文中，我们探讨了大批量优化的损失景观的陡峭度，以调整预训练的基于训练的变压器的语言模型对特定领域的任务，并发现它往往是高度复杂和不规则的，对下游任务的概括构成了挑战。为了应对这一挑战，我们提出了Scala，这是一种新颖有效的方法，以加速预训练的变压器网络的适应速度。与先前的方法不同，我们通过将轻量级的对抗噪声添加到大批量优化中来采用顺序的游戏理论方法，从而在保留模型概括的同时显着提高了适应速度。实验结果表明，Scala在基线上获得了2.7--9.8 $ \ times $适应性加速度，用于BERT-BASE和ROBERTA-LARGE上的胶水，同时比最先进的大批量优化方法获得了可比性，有时甚至更高的精度。最后，我们还通过对抗性噪声来解决大批量优化的理论方面，并使用用于分析非Convex鞍点问题的技术为Scala提供了理论收敛率分析。

In recent years, large pre-trained Transformer-based language models have led to dramatic improvements in many natural language understanding tasks. To train these models with increasing sizes, many neural network practitioners attempt to increase the batch sizes in order to leverage multiple GPUs to improve training speed. However, increasing the batch size often makes the optimization more difficult, leading to slow convergence or poor generalization that can require orders of magnitude more training time to achieve the same model quality. In this paper, we explore the steepness of the loss landscape of large-batch optimization for adapting pre-trained Transformer-based language models to domain-specific tasks and find that it tends to be highly complex and irregular, posing challenges to generalization on downstream tasks. To tackle this challenge, we propose ScaLA, a novel and efficient method to accelerate the adaptation speed of pre-trained transformer networks. Different from prior methods, we take a sequential game-theoretic approach by adding lightweight adversarial noise into large-batch optimization, which significantly improves adaptation speed while preserving model generalization. Experiment results show that ScaLA attains 2.7--9.8$\times$ adaptation speedups over the baseline for GLUE on BERT-base and RoBERTa-large, while achieving comparable and sometimes higher accuracy than the state-of-the-art large-batch optimization methods. Finally, we also address the theoretical aspect of large-batch optimization with adversarial noise and provide a theoretical convergence rate analysis for ScaLA using techniques for analyzing non-convex saddle-point problems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题