避免通信和内存约束的稀疏矩阵矩阵乘法在极端

论文标题

避免通信和内存约束的稀疏矩阵矩阵乘法在极端

Communication-Avoiding and Memory-Constrained Sparse Matrix-Matrix Multiplication at Extreme Scale

论文作者

Hussain, Md Taufique, Selvitopi, Oguz, Buluç, Aydin, Azad, Ariful

论文摘要

稀疏基质矩阵乘法（SPGEMM）是各种图，科学计算和机器学习算法中广泛使用的内核。在本文中，我们考虑对成千上万的处理器进行的SPGEMM在输出矩阵中产生数万亿个非齐率的处理器。在这个极端规模的分布式SPGEMM面临两个关键挑战：（1）高通信成本和（2）内存不足以生成输出。我们通过避免通信和内存约束的SPGEMM算法来解决这些挑战，该算法将其扩展到262,144个核心（超过100万个硬件线程），并且可以将任何大小的稀疏矩阵乘以任何尺寸的稀疏矩阵，例如输入和一小部分输出拟合在聚集的内存中。当我们从16,384个核心到262,144个核心XC40超级计算机时，新的SPGEMM算法在繁殖大规模蛋白质相似性矩阵时快速运行10倍。

Sparse matrix-matrix multiplication (SpGEMM) is a widely used kernel in various graph, scientific computing and machine learning algorithms. In this paper, we consider SpGEMMs performed on hundreds of thousands of processors generating trillions of nonzeros in the output matrix. Distributed SpGEMM at this extreme scale faces two key challenges: (1) high communication cost and (2) inadequate memory to generate the output. We address these challenges with an integrated communication-avoiding and memory-constrained SpGEMM algorithm that scales to 262,144 cores (more than 1 million hardware threads) and can multiply sparse matrices of any size as long as inputs and a fraction of output fit in the aggregated memory. As we go from 16,384 cores to 262,144 cores on a Cray XC40 supercomputer, the new SpGEMM algorithm runs 10x faster when multiplying large-scale protein-similarity matrices.

下载PDF全文

下载文献需遵守相关版权规定

论文标题