论文标题
AxFormer:更快,更小,更准确的NLP模型的变压器的准确驱动近似
AxFormer: Accuracy-driven Approximation of Transformers for Faster, Smaller and more Accurate NLP Models
论文作者
论文摘要
近年来,变形金刚大大提高了自然语言处理(NLP)的最先进,但表现出非常大的计算和存储要求。我们观察到,变压器的设计过程(以自我监督的方式预先训练了大型数据集上的基础模型,随后将其用于不同的下游任务)导致特定于任务特定的模型,这些模型高度高度地分配,不利地影响准确性和推断效率。我们提出了Axformer,这是一个系统的框架,该框架应用精度驱动的近似值来为给定的下游任务创建优化的变压器模型。 Axformer结合了两个关键的优化 - 精度驱动的修剪和选择性的硬注意。准确驱动的修剪确定并删除了微调变压器的一部分,从而阻碍了给定下游任务的性能。稀疏的硬注意通过消除无关的单词聚合来优化选定层中的注意力块,从而帮助模型只关注输入的相关部分。实际上,Axformer会导致更准确的模型,同时更快,更小。我们在胶水和小队任务上的实验表明,轴形模型的准确性高达4.5%,同时比传统的微型模型更快且高达2.5倍,高达3.2倍。此外,我们证明了轴形物可以与以前的努力(例如蒸馏或量化)结合使用,以实现进一步的效率提高。
Transformers have greatly advanced the state-of-the-art in Natural Language Processing (NLP) in recent years, but present very large computation and storage requirements. We observe that the design process of Transformers (pre-train a foundation model on a large dataset in a self-supervised manner, and subsequently fine-tune it for different downstream tasks) leads to task-specific models that are highly over-parameterized, adversely impacting both accuracy and inference efficiency. We propose AxFormer, a systematic framework that applies accuracy-driven approximations to create optimized transformer models for a given downstream task. AxFormer combines two key optimizations -- accuracy-driven pruning and selective hard attention. Accuracy-driven pruning identifies and removes parts of the fine-tuned transformer that hinder performance on the given downstream task. Sparse hard-attention optimizes attention blocks in selected layers by eliminating irrelevant word aggregations, thereby helping the model focus only on the relevant parts of the input. In effect, AxFormer leads to models that are more accurate, while also being faster and smaller. Our experiments on GLUE and SQUAD tasks show that AxFormer models are up to 4.5% more accurate, while also being up to 2.5X faster and up to 3.2X smaller than conventional fine-tuned models. In addition, we demonstrate that AxFormer can be combined with previous efforts such as distillation or quantization to achieve further efficiency gains.