论文标题
嵌入式GPU的性能感知卷积神经网络通道修剪
Performance Aware Convolutional Neural Network Channel Pruning for Embedded GPUs
论文作者
论文摘要
卷积神经网络(CNN)由于其出色的识别准确性,在许多应用和服务中都成为普遍的存在。尽管已经考虑了几种模型压缩技术,但它们越来越多地用于移动设备上,仅通过移植用于服务器空间的大型型号。一种旨在减少计算的模型压缩技术是通道修剪。现在,移动系统和嵌入式系统具有GPU,非常适合对神经网络的平行计算及其每次操作的能源成本较低。专业库通过高度优化的例程执行这些神经网络计算。正如我们在实验中发现的那样,这些库是针对最常见的网络形状进行了优化的,从而使未实现的通道修剪效率低下。我们评估了更高级别的库,该库分析了卷积层的输入特性,它们基于它们产生优化的OpenCL(ARM Compute库和TVM)和CUDA(CUDNN)代码。但是,实际上,这些特征和旨在优化的选择可以具有相反的效果。我们表明,在某些情况下,减少卷积通道的数量减少了初始大小的12%,这会损害性能,从而导致2倍放缓。另一方面,我们还找到了示例,在这些示例中,性能感知到的修剪可以实现预期的结果,而cudnn的性能加速为3倍,而ARM Compute库和TVM的速度为10倍。我们的发现揭示了对硬件结构的神经网络修剪的需求。
Convolutional Neural Networks (CNN) are becoming a common presence in many applications and services, due to their superior recognition accuracy. They are increasingly being used on mobile devices, many times just by porting large models designed for server space, although several model compression techniques have been considered. One model compression technique intended to reduce computations is channel pruning. Mobile and embedded systems now have GPUs which are ideal for the parallel computations of neural networks and for their lower energy cost per operation. Specialized libraries perform these neural network computations through highly optimized routines. As we find in our experiments, these libraries are optimized for the most common network shapes, making uninstructed channel pruning inefficient. We evaluate higher level libraries, which analyze the input characteristics of a convolutional layer, based on which they produce optimized OpenCL (Arm Compute Library and TVM) and CUDA (cuDNN) code. However, in reality, these characteristics and subsequent choices intended for optimization can have the opposite effect. We show that a reduction in the number of convolutional channels, pruning 12% of the initial size, is in some cases detrimental to performance, leading to 2x slowdown. On the other hand, we also find examples where performance-aware pruning achieves the intended results, with performance speedups of 3x with cuDNN and above 10x with Arm Compute Library and TVM. Our findings expose the need for hardware-instructed neural network pruning.