基于多GPU的HPC体系结构的快速且可扩展的稀疏三角求解器

论文标题

基于多GPU的HPC体系结构的快速且可扩展的稀疏三角求解器

Fast and Scalable Sparse Triangular Solver for Multi-GPU Based HPC Architectures

论文作者

Xie, Chenhao, Chen, Jieyang, Firoz, Jesun S, Li, Jiajia, Song, Shuaiwen Leon, Barker, Kevin, Raugas, Mark, Li, Ang

论文摘要

在现代多GPU的HPC系统上设计高效且可扩展的稀疏线性代数内核是一项艰巨的任务，因为整个GPU的明显不规则内存参考和工作负载不平衡。稀疏三角求解器（SPTRSV）尤其如此，该求解器在随后的计算步骤中引入了其他二维计算依赖项。依赖性信息在GPU之间交换和共享，因此有效的记忆分配，数据分配和工作负载分布以及细粒度的通信和同步支持。在这项工作中，我们证明直接采用统一的内存可能会对SPTRSV在多GPU体系结构上的性能产生不利影响，尽管通过快速互连（如NVLinks和nvswitches）链接。另外，我们采用了基于分区的全球地址空间编程模型的最新NVSHMEM技术，以实现有效的细粒度通信和急剧的同步间接费用。此外，为了处理工作负载不平衡，我们提出了一个可延展的任务池执行模型，该模型可以进一步增强GPU的利用率。通过应用这些技术，我们对NVIDIA多GPU超级节点V100-DGX-1和DGX-2系统进行的实验表明，在DGX-1系统上，我们的设计平均可以实现3.53倍（高达9.86 x）的速度，并在3.66 x（3.66 x（最高9.64 x）的速度上，多GPES System in 4-GPUS越过4-GPUS，高达9.64 x）。全面的灵敏度和可扩展性研究还表明，提出的零拷贝SPTRSV能够充分利用多GPU系统的计算和通信资源。

Designing efficient and scalable sparse linear algebra kernels on modern multi-GPU based HPC systems is a daunting task due to significant irregular memory references and workload imbalance across the GPUs. This is particularly the case for Sparse Triangular Solver (SpTRSV) which introduces additional two-dimensional computation dependencies among subsequent computation steps. Dependency information is exchanged and shared among GPUs, thus warrant for efficient memory allocation, data partitioning, and workload distribution as well as fine-grained communication and synchronization support. In this work, we demonstrate that directly adopting unified memory can adversely affect the performance of SpTRSV on multi-GPU architectures, despite linking via fast interconnect like NVLinks and NVSwitches. Alternatively, we employ the latest NVSHMEM technology based on Partitioned Global Address Space programming model to enable efficient fine-grained communication and drastic synchronization overhead reduction. Furthermore, to handle workload imbalance, we propose a malleable task-pool execution model which can further enhance the utilization of GPUs. By applying these techniques, our experiments on the NVIDIA multi-GPU supernode V100-DGX-1 and DGX-2 systems demonstrate that our design can achieve on average 3.53x (up to 9.86x) speedup on a DGX-1 system and 3.66x (up to 9.64x) speedup on a DGX-2 system with 4-GPUs over the Unified-Memory design. The comprehensive sensitivity and scalability studies also show that the proposed zero-copy SpTRSV is able to fully utilize the computing and communication resources of the multi-GPU system.

下载PDF全文

下载文献需遵守相关版权规定

论文标题