论文标题

基于多GPU的HPC体系结构的快速且可扩展的稀疏三角求解器

Fast and Scalable Sparse Triangular Solver for Multi-GPU Based HPC Architectures

论文作者

Xie, Chenhao, Chen, Jieyang, Firoz, Jesun S, Li, Jiajia, Song, Shuaiwen Leon, Barker, Kevin, Raugas, Mark, Li, Ang

论文摘要

在现代多GPU的HPC系统上设计高效且可扩展的稀疏线性代数内核是一项艰巨的任务,因为整个GPU的明显不规则内存参考和工作负载不平衡。稀疏三角求解器(SPTRSV)尤其如此,该求解器在随后的计算步骤中引入了其他二维计算依赖项。依赖性信息在GPU之间交换和共享,因此有效的记忆分配,数据分配和工作负载分布以及细粒度的通信和同步支持。在这项工作中,我们证明直接采用统一的内存可能会对SPTRSV在多GPU体系结构上的性能产生不利影响,尽管通过快速互连(如NVLinks和nvswitches)链接。另外,我们采用了基于分区的全球地址空间编程模型的最新NVSHMEM技术,以实现有效的细粒度通信和急剧的同步间接费用。此外,为了处理工作负载不平衡,我们提出了一个可延展的任务池执行模型,该模型可以进一步增强GPU的利用率。通过应用这些技术,我们对NVIDIA多GPU超级节点V100-DGX-1和DGX-2系统进行的实验表明,在DGX-1系统上,我们的设计平均可以实现3.53倍(高达9.86 x)的速度,并在3.66 x(3.66 x(最高9.64 x)的速度上,多GPES System in 4-GPUS越过4-GPUS,高达9.64 x)。全面的灵敏度和可扩展性研究还表明,提出的零拷贝SPTRSV能够充分利用多GPU系统的计算和通信资源。

Designing efficient and scalable sparse linear algebra kernels on modern multi-GPU based HPC systems is a daunting task due to significant irregular memory references and workload imbalance across the GPUs. This is particularly the case for Sparse Triangular Solver (SpTRSV) which introduces additional two-dimensional computation dependencies among subsequent computation steps. Dependency information is exchanged and shared among GPUs, thus warrant for efficient memory allocation, data partitioning, and workload distribution as well as fine-grained communication and synchronization support. In this work, we demonstrate that directly adopting unified memory can adversely affect the performance of SpTRSV on multi-GPU architectures, despite linking via fast interconnect like NVLinks and NVSwitches. Alternatively, we employ the latest NVSHMEM technology based on Partitioned Global Address Space programming model to enable efficient fine-grained communication and drastic synchronization overhead reduction. Furthermore, to handle workload imbalance, we propose a malleable task-pool execution model which can further enhance the utilization of GPUs. By applying these techniques, our experiments on the NVIDIA multi-GPU supernode V100-DGX-1 and DGX-2 systems demonstrate that our design can achieve on average 3.53x (up to 9.86x) speedup on a DGX-1 system and 3.66x (up to 9.64x) speedup on a DGX-2 system with 4-GPUs over the Unified-Memory design. The comprehensive sensitivity and scalability studies also show that the proposed zero-copy SpTRSV is able to fully utilize the computing and communication resources of the multi-GPU system.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源