帽子：有效自然语言处理的硬件感知变压器

论文标题

帽子：有效自然语言处理的硬件感知变压器

HAT: Hardware-Aware Transformers for Efficient Natural Language Processing

论文作者

Wang, Hanrui, Wu, Zhanghao, Liu, Zhijian, Cai, Han, Zhu, Ligeng, Gan, Chuang, Han, Song

论文摘要

变压器在自然语言处理（NLP）任务中无处不在，但是由于密集的计算，它们很难在硬件上部署。为了在资源受限的硬件平台上启用低延迟推断，我们建议通过神经体系结构搜索设计硬件感知变压器（HAT）。我们首先使用$ \ textIt {nutary encoder-decoder注意} $和$ \ textit {异质层} $构造一个大型设计空间。然后，我们培训涵盖设计空间中所有候选人的$ \ textit {supertransformer} $，并有效地生产了许多$ \ textit {subtransformers} $，并具有重量共享。最后，我们使用硬件延迟约束执行进化搜索，以找到专门的$ \ textIt {subtransformer} $专用在目标硬件上快速运行的$。在四个机器翻译任务上进行的大量实验表明，HAT可以发现不同硬件（CPU，GPU，IoT设备）的有效模型。在Raspberry Pi-4上运行WMT'14翻译任务时，HAT可以实现$ \ textBf {3} \ times $ speedup，$ \ textbf {3.7} \ times $ $ $ $ y hiper Baseleare Transformer; $ \ textbf {2.7} \ times $ speedup，$ \ textbf {3.6} \ times $ $较小的尺寸，而不是进化的变压器，带有$ \ textbf {12,041} \ times $更少的搜索成本，没有性能损失。帽子代码为https://github.com/mit-han-lab/hardware-aware-transformers.git

Transformers are ubiquitous in Natural Language Processing (NLP) tasks, but they are difficult to be deployed on hardware due to the intensive computation. To enable low-latency inference on resource-constrained hardware platforms, we propose to design Hardware-Aware Transformers (HAT) with neural architecture search. We first construct a large design space with $\textit{arbitrary encoder-decoder attention}$ and $\textit{heterogeneous layers}$. Then we train a $\textit{SuperTransformer}$ that covers all candidates in the design space, and efficiently produces many $\textit{SubTransformers}$ with weight sharing. Finally, we perform an evolutionary search with a hardware latency constraint to find a specialized $\textit{SubTransformer}$ dedicated to run fast on the target hardware. Extensive experiments on four machine translation tasks demonstrate that HAT can discover efficient models for different hardware (CPU, GPU, IoT device). When running WMT'14 translation task on Raspberry Pi-4, HAT can achieve $\textbf{3}\times$ speedup, $\textbf{3.7}\times$ smaller size over baseline Transformer; $\textbf{2.7}\times$ speedup, $\textbf{3.6}\times$ smaller size over Evolved Transformer with $\textbf{12,041}\times$ less search cost and no performance loss. HAT code is https://github.com/mit-han-lab/hardware-aware-transformers.git

下载PDF全文

下载文献需遵守相关版权规定

论文标题