加速基于CPU的稀疏一般矩阵乘法与二进制行合并

论文标题

加速基于CPU的稀疏一般矩阵乘法与二进制行合并

Accelerating CPU-Based Sparse General Matrix Multiplication With Binary Row Merging

论文作者

Du, Zhaoyang, Guan, Yijin, Guan, Tianchan, Niu, Dimin, Zheng, Hongzhong, Xie, Yuan

论文摘要

稀疏的一般矩阵乘法（SPGEMM）是许多现实世界应用的基本构建块。由于SPGEMM是众所周知的内存应用程序，具有庞大且不规则的内存访问，因此考虑到内存访问效率对于SPGEMM的性能至关重要。但是，现有方法对内存子系统的考虑较少，并实现了次优性能。在本文中，我们彻底分析了SPGEMM的内存访问模式及其对内存子系统的影响。基于分析，我们为多核CPU体系结构提出了一种名为BRMERGE的新颖，更有效的积累方法。 BRMERGE累积方法遵循行明数据流。它首先访问$ b $矩阵，生成一个输出行的中间列表，并将这些中间列表存储在连续的内存空间中，该列表由ping-pong-pong buffer实现。然后，它立即将上一阶段第二阶段产生的这些中间列表合并在两个乒乓缓冲区之间的树状层次结构中。 BRMERGE的架构优势是1）流式访问模式，2）最小化TLB缓存率和3）相当高的L1/L2高速缓存命中率相当高，这会导致低访问延迟和执行SPGEMM时的较高的带宽利用率。基于BRMERGE累积方法，我们建议使用使用不同分配方法的两个SPGEMM库，名为Brmerge-upper和Brmerge-precise。在两个CPU服务器上使用26个常用基准测试的性能评估表明，所提出的SPGEMM库大大优于最先进的SPGEMM库。

Sparse general matrix multiplication (SpGEMM) is a fundamental building block for many real-world applications. Since SpGEMM is a well-known memory-bounded application with vast and irregular memory accesses, considering the memory access efficiency is of critical importance for SpGEMM's performance. Yet, the existing methods put less consideration into the memory subsystem and achieved suboptimal performance. In this paper, we thoroughly analyze the memory access patterns of SpGEMM and their influences on the memory subsystem. Based on the analysis, we propose a novel and more efficient accumulation method named BRMerge for the multi-core CPU architectures. The BRMerge accumulation method follows the row-wise dataflow. It first accesses the $B$ matrix, generates the intermediate lists for one output row, and stores these intermediate lists in a consecutive memory space, which is implemented by a ping-pong buffer. It then immediately merges these intermediate lists generated in the previous phase two by two in a tree-like hierarchy between two ping-pong buffers. The architectural benefits of BRMerge are 1) streaming access patterns, 2) minimized TLB cache miss rate, and 3) reasonably high L1/L2 cache hit rates, which result in both low access latency and high bandwidth utilization when performing SpGEMM. Based on the BRMerge accumulation method, we propose two SpGEMM libraries named BRMerge-Upper and BRMerge-Precise, which use different allocation methods. Performance evaluations with 26 commonly used benchmarks on two CPU servers show that the proposed SpGEMM libraries significantly outperform the state-of-the-art SpGEMM libraries.

下载PDF全文

下载文献需遵守相关版权规定

论文标题