改进训练后神经量化：层面校准和整数编程

论文标题

改进训练后神经量化：层面校准和整数编程

Improving Post Training Neural Quantization: Layer-wise Calibration and Integer Programming

论文作者

Hubara, Itay, Nahshan, Yury, Hanani, Yair, Banner, Ron, Soudry, Daniel

论文摘要

最近，训练后量化方法引起了很大的关注，因为它们易于使用，并且只需要一个小的未标记校准集。这个小型数据集不能无需过度拟合就可以用来微调模型。相反，这些方法仅使用校准设置来设置激活的动态范围。但是，当使用以下8位（小数据集除外）时，这种方法总是会导致明显的准确性降解。在这里，我们的目标是打破8位障碍。为此，我们通过在校准集上优化其参数分别优化其参数，从而最大程度地减少了每一层的量化误差。我们从经验上证明，这种方法是：（1）比标准的微调方法不容易过度贴合得多，即使在非常小的校准集中也可以使用；（2）比以前的方法更强大，后者仅设置激活的动态范围。此外，我们演示了如何最佳地分配每一层的位宽度，同时通过提出一种新颖的整数编程公式来约束准确性降解或模型压缩。最后，我们建议模型全局统计数据调整，以纠正量化过程中引入的偏差。这些方法共同为视觉和文本模型产生最新结果。例如，在RESNET50上，我们获得的精度降低少于1 \％ - 在所有层中具有4位的权重和激活，但最小的两个。我们开源的代码。

Lately, post-training quantization methods have gained considerable attention, as they are simple to use, and require only a small unlabeled calibration set. This small dataset cannot be used to fine-tune the model without significant over-fitting. Instead, these methods only use the calibration set to set the activations' dynamic ranges. However, such methods always resulted in significant accuracy degradation, when used below 8-bits (except on small datasets). Here we aim to break the 8-bit barrier. To this end, we minimize the quantization errors of each layer separately by optimizing its parameters over the calibration set. We empirically demonstrate that this approach is: (1) much less susceptible to over-fitting than the standard fine-tuning approaches, and can be used even on a very small calibration set; and (2) more powerful than previous methods, which only set the activations' dynamic ranges. Furthermore, we demonstrate how to optimally allocate the bit-widths for each layer, while constraining accuracy degradation or model compression by proposing a novel integer programming formulation. Finally, we suggest model global statistics tuning, to correct biases introduced during quantization. Together, these methods yield state-of-the-art results for both vision and text models. For instance, on ResNet50, we obtain less than 1\% accuracy degradation --- with 4-bit weights and activations in all layers, but the smallest two. We open-sourced our code.

下载PDF全文

下载文献需遵守相关版权规定

论文标题