GPU上SPMM输入动力学的启发式适应性

论文标题

GPU上SPMM输入动力学的启发式适应性

Heuristic Adaptability to Input Dynamics for SpMM on GPUs

论文作者

Dai, Guohao, Huang, Guyue, Yang, Shang, Yu, Zhongming, Zhang, Hengrui, Ding, Yufei, Xie, Yuan, Yang, Huazhong, Wang, Yu

论文摘要

稀疏的基质矩阵乘法（SPMM）已成为各个域中的基本组件。由于GPU提供了高带宽和并行性，因此许多先前的研究将GPU用于SPMM加速度。我们指出，静态设计并不总是在不同的输入数据上提高SPMM的性能（例如，使用单个算法> 85 \％的性能损失）。在本文中，我们从新颖的自动调节角度考虑了输入动力学的挑战，而以下问题仍有待解决：（1）考虑到稀疏性的正交设计原则。应提取这种稀疏问题的正交设计原理，以形成不同的算法，并进一步用于性能调整。（2）算法空间中的非平凡实现。将正交设计原理结合起来创建新算法，需要应对诸如线索处理之类的新挑战。（3）输入动力学的启发式适应性。需要启发式适应性才能动态优化输入动力学的代码。为了应对这些挑战，我们首先提出了一种新型的三环模型，以提取GPUS上SPMM的正交设计原理。该模型不仅涵盖了以前的SPMM设计，而且还提出了以前研究的新设计。我们提出了诸如减少有条件减少的技术，以实现先前研究中缺少的算法。我们进一步提出了DA-SPMM，这是SPMM的数据感知启发式GPU内核。 DA-SPMM自适应优化了考虑输入动态的代码。广泛的实验结果表明，与最佳NVIDIA CUSPARSE算法相比，DA-SPMM达到1.26 x〜1.37倍的速度，并为图形神经网络等应用程序带来了高达5.59倍的端到端速度。

Sparse Matrix-Matrix Multiplication (SpMM) has served as fundamental components in various domains. Many previous studies exploit GPUs for SpMM acceleration because GPUs provide high bandwidth and parallelism. We point out that a static design does not always improve the performance of SpMM on different input data (e.g., >85\% performance loss with a single algorithm). In this paper, we consider the challenge of input dynamics from a novel auto-tuning perspective, while following issues remain to be solved: (1) Orthogonal design principles considering sparsity. Orthogonal design principles for such a sparse problem should be extracted to form different algorithms, and further used for performance tuning. (2) Nontrivial implementations in the algorithm space. Combining orthogonal design principles to create new algorithms needs to tackle with new challenges like thread race handling. (3) Heuristic adaptability to input dynamics. The heuristic adaptability is required to dynamically optimize code for input dynamics. To tackle these challenges, we first propose a novel three-loop model to extract orthogonal design principles for SpMM on GPUs. The model not only covers previous SpMM designs, but also comes up with new designs absent from previous studies. We propose techniques like conditional reduction to implement algorithms missing in previous studies. We further propose DA-SpMM, a Data-Aware heuristic GPU kernel for SpMM. DA-SpMM adaptively optimizes code considering input dynamics. Extensive experimental results show that, DA-SpMM achieves 1.26x~1.37x speedup compared with the best NVIDIA cuSPARSE algorithm on average, and brings up to 5.59x end-to-end speedup to applications like Graph Neural Networks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题