通过低位NXM稀疏性来压缩预训练的变压器，以了解自然语言

论文标题

通过低位NXM稀疏性来压缩预训练的变压器，以了解自然语言

Compressing Pre-trained Transformers via Low-Bit NxM Sparsity for Natural Language Understanding

论文作者

Holmes, Connor, Zhang, Minjia, He, Yuxiong, Wu, Bo

论文摘要

近年来，大型预训练的变压器网络已显示出许多自然语言理解任务的巨大改进。但是，由于延迟和成本限制，这些模型的巨大规模给它们的微调和在线部署带来了重大挑战。支持N：M半结构化的稀疏性和低精度整数计算的新硬件是提高DNN模型效率的有前途解决方案。但是，很少有研究系统地研究这些技术的组合以及如何最佳地压缩变压器的每个组件，从而系统地研究了预训练的变压器网络在多大程度上受益。我们提出了一个灵活的压缩框架NXMiformer，该框架使用ADMM和基于Ste的QAT执行同时进行稀疏和量化。此外，我们介绍且廉价的启发式驱动搜索算法，该算法标识了满足压缩比约束的有希望的异质压缩配置。当通过NLU基准测试的胶水套件进行评估时，我们的方法可以实现BERT模型编码器的93％压缩，同时保留了98.2％的原始模型准确性并充分利用硬件功能。异质配置通过搜索启发式发现了基线精度的99.5％，同时仍将模型压缩为87.5％。

In recent years, large pre-trained Transformer networks have demonstrated dramatic improvements in many natural language understanding tasks. However, the huge size of these models brings significant challenges to their fine-tuning and online deployment due to latency and cost constraints. New hardware supporting both N:M semi-structured sparsity and low-precision integer computation is a promising solution to boost DNN model serving efficiency. However, there have been very few studies that systematically investigate to what extent pre-trained Transformer networks benefit from the combination of these techniques, as well as how to best compress each component of the Transformer. We propose a flexible compression framework NxMiFormer that performs simultaneous sparsification and quantization using ADMM and STE-based QAT. Furthermore, we present and inexpensive, heuristic-driven search algorithm that identifies promising heterogeneous compression configurations that meet a compression ratio constraint. When evaluated across the GLUE suite of NLU benchmarks, our approach can achieve up to 93% compression of the encoders of a BERT model while retaining 98.2% of the original model accuracy and taking full advantage of the hardware's capabilities. Heterogeneous configurations found the by the search heuristic maintain 99.5% of the baseline accuracy while still compressing the model by 87.5%.

下载PDF全文

下载文献需遵守相关版权规定

论文标题