多次强制执行培训（MUPPET）：用于量化CNN的定固点培训的精确切换策略

论文标题

多次强制执行培训（MUPPET）：用于量化CNN的定固点培训的精确切换策略

Multi-Precision Policy Enforced Training (MuPPET): A precision-switching strategy for quantised fixed-point training of CNNs

论文作者

Rajagopal, Aditya, Vink, Diederik Adriaan, Venieris, Stylianos I., Bouganis, Christos-Savvas

论文摘要

大规模的卷积神经网络（CNN）遭受了很长的训练时间，跨越了几个小时到几周，从而限制了深度学习从业者的生产力和实验。随着网络的规模和复杂性的增长，可以通过低精度数据表示和计算来减少训练时间。但是，在这样做的过程中，由于梯度消失的问题，最终的准确性受到了影响。现有的最新方法通过使用两个不同的精确水平（32位浮点）和FP16/FP8（16-/8位浮点）来解决此问题，利用FP16操作获得FP16操作的硬件支持。这项工作通过采用多级优化方法来推动量化训练的边界，该方法利用了多个精确度，包括低精确的定点表示。新颖的培训策略Muppet将使用多数表示制度的使用与精确转换机制相结合，该机制在运行时决定精确制度之间的过渡点。总体而言，拟议的策略将培训过程定制为目标硬件架构的硬件级功能，并与最先进的方法相比，培训时间和能源效率的提高。在ImageNet（ILSVRC12）上应用Muppet训练Alexnet，Resnet18和Googlenet并针对Nvidia Turing GPU，Muppet的准确性与标准的完整培训相同的准确性，具有相同的培训时间，培训时间的速度高达1.84美元$ \ \ \ \ \\ $ \ $ \ \ $ \ $ \ $ \ $ \ $ $ $ $ $ \ $ \ $ \ $ \ $ \ \ \ \ \ \ \ \ $ \ $ \ \ $ \ bess。

Large-scale convolutional neural networks (CNNs) suffer from very long training times, spanning from hours to weeks, limiting the productivity and experimentation of deep learning practitioners. As networks grow in size and complexity, training time can be reduced through low-precision data representations and computations. However, in doing so the final accuracy suffers due to the problem of vanishing gradients. Existing state-of-the-art methods combat this issue by means of a mixed-precision approach utilising two different precision levels, FP32 (32-bit floating-point) and FP16/FP8 (16-/8-bit floating-point), leveraging the hardware support of recent GPU architectures for FP16 operations to obtain performance gains. This work pushes the boundary of quantised training by employing a multilevel optimisation approach that utilises multiple precisions including low-precision fixed-point representations. The novel training strategy, MuPPET, combines the use of multiple number representation regimes together with a precision-switching mechanism that decides at run time the transition point between precision regimes. Overall, the proposed strategy tailors the training process to the hardware-level capabilities of the target hardware architecture and yields improvements in training time and energy efficiency compared to state-of-the-art approaches. Applying MuPPET on the training of AlexNet, ResNet18 and GoogLeNet on ImageNet (ILSVRC12) and targeting an NVIDIA Turing GPU, MuPPET achieves the same accuracy as standard full-precision training with training-time speedup of up to 1.84$\times$ and an average speedup of 1.58$\times$ across the networks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题