论文标题
GE-SPMM:GPU上的GPU的通用稀疏基质矩阵乘法图形神经网络
GE-SpMM: General-purpose Sparse Matrix-Matrix Multiplication on GPUs for Graph Neural Networks
论文作者
论文摘要
图神经网络(GNN)在各个领域都取得了重大改进。稀疏基质矩阵乘法(SPMM)是GNNS中的基本操作员,它在稀疏矩阵和密集矩阵之间执行乘法。加速SPMM在GPU等并行硬件上的加速可能会面临以下挑战:从GNN应用程序的角度来看,需要考虑兼容性。一般的GNN算法需要在矩阵之间进行类似SPMM的操作(例如,汇总),这些操作在当前高性能GPU库中不支持(例如NVIDIA CUSPARSE)。此外,先前实施中的复杂预处理将导致GNN框架中的重型数据格式转换开销。从GPU硬件的角度来看,GPU上的SPMV(稀疏矩阵矢量)设计的优化不适用于SPMM。 SPMM在密集输出矩阵中揭示了列的平行性,但是SPMV的直接概括导致在全局内存中略有效率,无法平息的访问稀疏矩阵。稀疏的行数据可以在GPU线程中重复使用,这在SPMV继承的SPMM设计中既不可能。 为了应对这些挑战,我们提出了GE-SPMM。 GE-SPMM在最常见的压缩稀疏行(CSR)格式中表示的稀疏矩阵上执行类似SPMM的操作,因此可以将其嵌入GNN框架中,而没有预处理开销,并支持一般的GNN算法。我们介绍了合并的行缓存方法,以并行处理列,并确保合并访问稀疏矩阵数据。我们还提出了粗粒的经融合,以减少GPU扭曲之间的冗余数据加载。在现实世界图数据集上的实验表明,GE-SPMM在NVIDIA CUSPARSE上的速度高达1.41倍,并且在Glumblast上达到了1.81倍。我们还将GE-SPMM嵌入GNN框架中,并在GCN和GraphSage等流行的GNN型号上提高了3.67倍的速度。
Graph Neural Networks (GNNs) have achieved significant improvements in various domains. Sparse Matrix-Matrix multiplication (SpMM) is a fundamental operator in GNNs, which performs a multiplication between a sparse matrix and a dense matrix. Accelerating SpMM on parallel hardware like GPUs can face the following challenges: From the GNN application perspective, the compatibility needs to be considered. General GNN algorithms require SpMM-like operations (e.g., pooling) between matrices, which are not supported in current high-performance GPU libraries (e.g., Nvidia cuSPARSE). Moreover, the sophisticated preprocessing in previous implementations will lead to heavy data format conversion overheads in GNN frameworks. From the GPU hardware perspective, optimizations in SpMV (Sparse Matrix-Vector) designs on GPUs do not apply well to SpMM. SpMM exposes the column-wise parallelism in the dense output matrix, but straightforward generalization from SpMV leads to inefficient, uncoalesced access to sparse matrix in global memory. The sparse row data can be reused among GPU threads, which is neither possible in SpMM designs inherited from SpMV. To tackle these challenges, we propose GE-SpMM. GE-SpMM performs SpMM-like operation on sparse matrices represented in the most common Compressed Sparse Row (CSR) format, so it can be embedded in GNN frameworks with no preprocessing overheads and support general GNN algorithms. We introduce the Coalesced Row Caching method to process columns in parallel and ensure coalesced access to sparse matrix data. We also present the Coarse-grained Warp Merging to reduce redundant data loading among GPU warps. Experiments on a real-world graph dataset show that GE-SpMM achieves up to 1.41X speedup over Nvidia cuSPARSE and up to 1.81X over GraphBLAST. We also embed GE-SpMM in GNN frameworks and get up to 3.67X speedup over popular GNN models like GCN and GraphSAGE.