SHFL-BW：加速深层神经网络推断，张量核心意识重量修剪

论文标题

SHFL-BW：加速深层神经网络推断，张量核心意识重量修剪

Shfl-BW: Accelerating Deep Neural Network Inference with Tensor-Core Aware Weight Pruning

论文作者

Huang, Guyue, Li, Haoran, Qin, Minghai, Sun, Fei, Ding, Yufei, Xie, Yuan

论文摘要

深度神经网络（DNN）中的重量修剪可以降低存储和计算成本，但努力为模型推断时间带来实用的加速。张量核可以显着提高GPU在密集计算中的吞吐量，但是稀疏DNN的张量核漏洞非常具有挑战性。与现有的CUDA核相比，张量核需要更高的数据重用和基质形的指令粒度，这两者都难以从稀疏的DNN内核中产生。现有的修剪方法无法平衡准确性和效率的需求：随机稀疏性可很好地保留模型质量，但禁止张量核心加速，而高度结构化的块状稀疏性可以利用张量张孔核，但遭受严重准确的损失。在这项工作中，我们提出了一种新颖的稀疏模式，洗牌的块状稀疏性（SHFL-BW），旨在有效利用张量式核心，同时最大程度地减少对重量结构的约束。我们的见解是，行和列的置换为重量结构提供了丰富的灵活性，同时使用我们的GPU内核设计引入了可忽略的开销。我们在线性和卷积层中优化了SHFL-BW的GPU内核。评估表明，我们的技术可以实现GPU上最新的速度准确性权衡。例如，由于精度损失较小，我们可以在NVIDIA V100，T4和A100 GPU上分别以75％的稀疏性加速变压器的计算密集型层。

Weight pruning in deep neural networks (DNNs) can reduce storage and computation cost, but struggles to bring practical speedup to the model inference time. Tensor-cores can significantly boost the throughput of GPUs on dense computation, but exploiting tensor-cores for sparse DNNs is very challenging. Compared to existing CUDA-cores, tensor-cores require higher data reuse and matrix-shaped instruction granularity, both difficult to yield from sparse DNN kernels. Existing pruning approaches fail to balance the demands of accuracy and efficiency: random sparsity preserves the model quality well but prohibits tensor-core acceleration, while highly-structured block-wise sparsity can exploit tensor-cores but suffers from severe accuracy loss. In this work, we propose a novel sparse pattern, Shuffled Block-wise sparsity (Shfl-BW), designed to efficiently utilize tensor-cores while minimizing the constraints on the weight structure. Our insight is that row- and column-wise permutation provides abundant flexibility for the weight structure, while introduces negligible overheads using our GPU kernel designs. We optimize the GPU kernels for Shfl-BW in linear and convolution layers. Evaluations show that our techniques can achieve the state-of-the-art speed-accuracy trade-offs on GPUs. For example, with small accuracy loss, we can accelerate the computation-intensive layers of Transformer by 1.81, 4.18 and 1.90 times on NVIDIA V100, T4 and A100 GPUs respectively at 75% sparsity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题