加速稀疏的DNN模型，而无需通过瓷砖稀疏而无需硬件供应

论文标题

加速稀疏的DNN模型，而无需通过瓷砖稀疏而无需硬件供应

Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise Sparsity

论文作者

Guo, Cong, Hsueh, Bo Yang, Leng, Jingwen, Qiu, Yuxian, Guan, Yue, Wang, Zehuan, Jia, Xiaoying, Li, Xipeng, Guo, Minyi, Zhu, Yuhao

论文摘要

网络修剪可以降低深神经网络（DNN）模型的高计算成本。但是，为了保持其准确性，稀疏模型通常具有随机分布的权重，导致计算不规则。因此，稀疏模型无法实现用于商品硬件（例如GPU）为密集矩阵计算而设计的有意义的加速。因此，先前的作品通常会修改或设计全新的稀疏性建筑，以利用稀疏性。我们提出了一种算法 - 软件共同设计的修剪方法，该方法可以在现有的密集体系结构上实现延迟速度。我们的工作建立在以下见解的基础上，即矩阵乘法通常将大矩阵分解为多个较小的图块以进行并行执行。我们提出了一种易于友好的“瓷砖”稀疏模式，该模式在瓷砖级别保持常规模式以有效执行，但可以在全球范围内进行不规则，任意的修剪以保持高精度。我们在GPU张量芯上实施和评估稀疏模式，在密集模型上实现了1.95倍的速度。

Network pruning can reduce the high computation cost of deep neural network (DNN) models. However, to maintain their accuracies, sparse models often carry randomly-distributed weights, leading to irregular computations. Consequently, sparse models cannot achieve meaningful speedup on commodity hardware (e.g., GPU) built for dense matrix computations. As such, prior works usually modify or design completely new sparsity-optimized architectures for exploiting sparsity. We propose an algorithm-software co-designed pruning method that achieves latency speedups on existing dense architectures. Our work builds upon the insight that the matrix multiplication generally breaks the large matrix into multiple smaller tiles for parallel execution. We propose a tiling-friendly "tile-wise" sparsity pattern, which maintains a regular pattern at the tile level for efficient execution but allows for irregular, arbitrary pruning at the global scale to maintain the high accuracy. We implement and evaluate the sparsity pattern on GPU tensor core, achieving a 1.95x speedup over the dense model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题