论文标题
基于MLP的视力模型的量化分析
Analysis of Quantization on MLP-based Vision Models
论文作者
论文摘要
量化被疯狂地作为模型压缩技术,该技术通过将神经网络中的浮点重量和激活转换为较低位整数来获得有效的模型。量化已被证明可以在卷积神经网络和基于变压器的模型上运作良好。尽管这些模型具有符合性的典型性,但最近的作品表明,基于MLP的模型能够在从计算机视觉,NLP到3D点云等各种任务上取得可比的结果,同时由于并行性和网络简单性,可以实现更高的吞吐量。但是,正如我们在本文中所示的那样,将量化直接应用于基于MLP的模型将导致明显的准确性降解。基于我们的分析,两个主要问题说明了准确性差距:1)基于MLP的模型的激活范围可能太大而无法量化,而2)基于MLP的模型中的特定组件对量化敏感。因此,我们建议1)应用分层以控制激活的量化范围,2)使用有界激活功能,3)将百分位数量化应用于激活,4)使用我们的改进的名为多代币混合MLP的模块,和5)将线性不对称量化应用于敏感操作。我们的Q-MLP模型配备了上述技术,具有8位均匀量化(型号30 MB)和78.47%的Imagenet的精度为79.68%,使用4位量化(15 MB)。
Quantization is wildly taken as a model compression technique, which obtains efficient models by converting floating-point weights and activations in the neural network into lower-bit integers. Quantization has been proven to work well on convolutional neural networks and transformer-based models. Despite the decency of these models, recent works have shown that MLP-based models are able to achieve comparable results on various tasks ranging from computer vision, NLP to 3D point cloud, while achieving higher throughput due to the parallelism and network simplicity. However, as we show in the paper, directly applying quantization to MLP-based models will lead to significant accuracy degradation. Based on our analysis, two major issues account for the accuracy gap: 1) the range of activations in MLP-based models can be too large to quantize, and 2) specific components in the MLP-based models are sensitive to quantization. Consequently, we propose to 1) apply LayerNorm to control the quantization range of activations, 2) utilize bounded activation functions, 3) apply percentile quantization on activations, 4) use our improved module named multiple token-mixing MLPs, and 5) apply linear asymmetric quantizer for sensitive operations. Equipped with the abovementioned techniques, our Q-MLP models can achieve 79.68% accuracy on ImageNet with 8-bit uniform quantization (model size 30 MB) and 78.47% with 4-bit quantization (15 MB).