近似：快速模拟DNN训练和推理的近似乘数

论文标题

近似：快速模拟DNN训练和推理的近似乘数

ApproxTrain: Fast Simulation of Approximate Multipliers for DNN Training and Inference

论文作者

Gong, Jing, Saadat, Hassaan, Gamaarachchi, Hasindu, Javaid, Haris, Hu, Xiaobo Sharon, Parameswaran, Sri

论文摘要

深度神经网络（DNN）的边缘训练是持续学习的理想目标。但是，这受到训练所需的巨大计算能力的阻碍。硬件近似乘数表明，它们在获得DNN推理加速器中获得资源效率的有效性；但是，使用近似乘数的培训在很大程度上尚未开发。为了通过支持DNN培训的近似乘数来构建资源有效的加速器，需要对不同DNN架构的培训收敛性和准确性进行彻底评估，并且需要不同的近似乘数。本文介绍了近似值，这是一个开源框架，允许使用模拟近似乘数快速评估DNN训练和推理。近似值与TensorFlow（TF）一样用户友好，仅需要对DNN体系结构的高级描述以及近似乘数的C/C ++功能模型。我们通过使用GPU（AMSIM）上的基于基于LUT的近似浮点数（FP）乘数模拟器来提高乘数级别的仿真速度。近似值利用CUDA并有效地将AMSIM集成到张量库中，以克服商业GPU中没有天然硬件的近似乘数。我们使用近似值来评估使用LENET和RESNETS体系结构的小型和大型数据集（包括Imagenet）的近似乘数的DNN训练的收敛性和准确性。与FP32和BFLOAT16乘数相比，评估表明，测试准确性的收敛行为和可忽略不计的变化可忽略不计。与训练和推理中基于CPU的近似乘数模拟相比，GPU加速近似值快于2500倍以上。基于具有本机硬件乘数的高度优化的封闭源Cudnn/Cublas库，原始的TensorFlow仅比近似值快8倍。

Edge training of Deep Neural Networks (DNNs) is a desirable goal for continuous learning; however, it is hindered by the enormous computational power required by training. Hardware approximate multipliers have shown their effectiveness for gaining resource-efficiency in DNN inference accelerators; however, training with approximate multipliers is largely unexplored. To build resource efficient accelerators with approximate multipliers supporting DNN training, a thorough evaluation of training convergence and accuracy for different DNN architectures and different approximate multipliers is needed. This paper presents ApproxTrain, an open-source framework that allows fast evaluation of DNN training and inference using simulated approximate multipliers. ApproxTrain is as user-friendly as TensorFlow (TF) and requires only a high-level description of a DNN architecture along with C/C++ functional models of the approximate multiplier. We improve the speed of the simulation at the multiplier level by using a novel LUT-based approximate floating-point (FP) multiplier simulator on GPU (AMSim). ApproxTrain leverages CUDA and efficiently integrates AMSim into the TensorFlow library, in order to overcome the absence of native hardware approximate multiplier in commercial GPUs. We use ApproxTrain to evaluate the convergence and accuracy of DNN training with approximate multipliers for small and large datasets (including ImageNet) using LeNets and ResNets architectures. The evaluations demonstrate similar convergence behavior and negligible change in test accuracy compared to FP32 and bfloat16 multipliers. Compared to CPU-based approximate multiplier simulations in training and inference, the GPU-accelerated ApproxTrain is more than 2500x faster. Based on highly optimized closed-source cuDNN/cuBLAS libraries with native hardware multipliers, the original TensorFlow is only 8x faster than ApproxTrain.

下载PDF全文

下载文献需遵守相关版权规定

论文标题