论文标题

SHFL-BW:加速深层神经网络推断,张量核心意识重量修剪

Shfl-BW: Accelerating Deep Neural Network Inference with Tensor-Core Aware Weight Pruning

论文作者

Huang, Guyue, Li, Haoran, Qin, Minghai, Sun, Fei, Ding, Yufei, Xie, Yuan

论文摘要

深度神经网络(DNN)中的重量修剪可以降低存储和计算成本,但努力为模型推断时间带来实用的加速。张量核可以显着提高GPU在密集计算中的吞吐量,但是稀疏DNN的张量核漏洞非常具有挑战性。与现有的CUDA核相比,张量核需要更高的数据重用和基质形的指令粒度,这两者都难以从稀疏的DNN内核中产生。现有的修剪方法无法平衡准确性和效率的需求:随机稀疏性可很好地保留模型质量,但禁止张量核心加速,而高度结构化的块状稀疏性可以利用张量张孔核,但遭受严重准确的损失。 在这项工作中,我们提出了一种新颖的稀疏模式,洗牌的块状稀疏性(SHFL-BW),旨在有效利用张量式核心,同时最大程度地减少对重量结构的约束。我们的见解是,行和列的置换为重量结构提供了丰富的灵活性,同时使用我们的GPU内核设计引入了可忽略的开销。我们在线性和卷积层中优化了SHFL-BW的GPU内核。评估表明,我们的技术可以实现GPU上最新的速度准确性权衡。例如,由于精度损失较小,我们可以在NVIDIA V100,T4和A100 GPU上分别以75%的稀疏性加速变压器的计算密集型层。

Weight pruning in deep neural networks (DNNs) can reduce storage and computation cost, but struggles to bring practical speedup to the model inference time. Tensor-cores can significantly boost the throughput of GPUs on dense computation, but exploiting tensor-cores for sparse DNNs is very challenging. Compared to existing CUDA-cores, tensor-cores require higher data reuse and matrix-shaped instruction granularity, both difficult to yield from sparse DNN kernels. Existing pruning approaches fail to balance the demands of accuracy and efficiency: random sparsity preserves the model quality well but prohibits tensor-core acceleration, while highly-structured block-wise sparsity can exploit tensor-cores but suffers from severe accuracy loss. In this work, we propose a novel sparse pattern, Shuffled Block-wise sparsity (Shfl-BW), designed to efficiently utilize tensor-cores while minimizing the constraints on the weight structure. Our insight is that row- and column-wise permutation provides abundant flexibility for the weight structure, while introduces negligible overheads using our GPU kernel designs. We optimize the GPU kernels for Shfl-BW in linear and convolution layers. Evaluations show that our techniques can achieve the state-of-the-art speed-accuracy trade-offs on GPUs. For example, with small accuracy loss, we can accelerate the computation-intensive layers of Transformer by 1.81, 4.18 and 1.90 times on NVIDIA V100, T4 and A100 GPUs respectively at 75% sparsity.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源