论文标题
以硬件为中心的汽车用于混合精确量化
Hardware-Centric AutoML for Mixed-Precision Quantization
论文作者
论文摘要
模型量化是一种广泛使用的技术,用于压缩和加速深度神经网络(DNN)推断。新兴的DNN硬件加速器开始支持混合精度(1-8位),以进一步提高计算效率,这引发了一个巨大的挑战,可以找到每一层的最佳位宽:它要求域专家探索在准确性,延迟,延迟,能源和模型大小之间进行庞大的设计空间交易,这既有时间耗费时间又是耗时和次级功能。常规量化算法忽略了不同的硬件体系结构,并以统一的方式量化所有层。在本文中,我们介绍了硬件意识自动量化(HAQ)框架,该框架利用强化学习来自动确定量化策略,并将硬件加速器的反馈在设计循环中。我们不依靠拖船和型号大小等代理信号,而是使用硬件模拟器来生成直接的反馈信号(延迟和能量)向RL代理生成。与常规方法相比,我们的框架是完全自动化的,可以专门针对不同的神经网络架构和硬件体系结构的量化策略。与固定的位宽(8位)量化相比,我们的框架有效地将潜伏期降低1.4-1.95倍,而能量消耗却可忽略不计,准确性丧失。我们的框架表明,在不同资源约束(即延迟,能量和模型大小)下,不同硬件体系结构(即边缘和云体系结构)上的最佳策略截然不同。我们解释了不同量化策略的含义,这些政策为神经网络架构设计和硬件体系结构设计提供了见解。
Model quantization is a widely used technique to compress and accelerate deep neural network (DNN) inference. Emergent DNN hardware accelerators begin to support mixed precision (1-8 bits) to further improve the computation efficiency, which raises a great challenge to find the optimal bitwidth for each layer: it requires domain experts to explore the vast design space trading off among accuracy, latency, energy, and model size, which is both time-consuming and sub-optimal. Conventional quantization algorithm ignores the different hardware architectures and quantizes all the layers in a uniform way. In this paper, we introduce the Hardware-Aware Automated Quantization (HAQ) framework which leverages the reinforcement learning to automatically determine the quantization policy, and we take the hardware accelerator's feedback in the design loop. Rather than relying on proxy signals such as FLOPs and model size, we employ a hardware simulator to generate direct feedback signals (latency and energy) to the RL agent. Compared with conventional methods, our framework is fully automated and can specialize the quantization policy for different neural network architectures and hardware architectures. Our framework effectively reduced the latency by 1.4-1.95x and the energy consumption by 1.9x with negligible loss of accuracy compared with the fixed bitwidth (8 bits) quantization. Our framework reveals that the optimal policies on different hardware architectures (i.e., edge and cloud architectures) under different resource constraints (i.e., latency, energy, and model size) are drastically different. We interpreted the implication of different quantization policies, which offer insights for both neural network architecture design and hardware architecture design.