训练大型，然后压缩：重新思考模型大小，以进行有效的训练和推理变压器

论文标题

训练大型，然后压缩：重新思考模型大小，以进行有效的训练和推理变压器

Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers

论文作者

Li, Zhuohan, Wallace, Eric, Shen, Sheng, Lin, Kevin, Keutzer, Kurt, Klein, Dan, Gonzalez, Joseph E.

论文摘要

由于硬件资源是有限的，因此训练深度学习模型的目的通常是在训练和推理的时间和记忆约束的情况下最大程度地提高准确性。我们研究了在这种情况下模型大小的影响，重点关注由计算限制的NLP任务的变压器模型：自我监管的预审计和高资源机器的翻译。我们首先表明，即使较小的变压器模型在迭代中执行更快，更宽，更深的模型也会以更少的步骤收敛。此外，这种收敛中的加速度通常超过了使用较大模型的其他计算开销。因此，最计算的培训策略是违反直觉训练极大的模型，但经过少量迭代后停止。这导致了大型变压器模型的训练效率与小型变压器模型的推理效率之间的明显权衡。但是，我们表明，与小型模型相比，大型模型对压缩技术（例如量化和修剪）更强大。因此，人们可以获得两全其美的最佳状态：压缩度严重，大型型号的精度比轻度压缩的小型模型更高。

Since hardware resources are limited, the objective of training deep learning models is typically to maximize accuracy subject to the time and memory constraints of training and inference. We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute: self-supervised pretraining and high-resource machine translation. We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps. Moreover, this acceleration in convergence typically outpaces the additional computational overhead of using larger models. Therefore, the most compute-efficient training strategy is to counterintuitively train extremely large models but stop after a small number of iterations. This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models. However, we show that large models are more robust to compression techniques such as quantization and pruning than small models. Consequently, one can get the best of both worlds: heavily compressed, large models achieve higher accuracy than lightly compressed, small models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题