FSHMEM：大规模硬件加速基础架构的FPGA上的支持分区的全球地址空间

论文标题

FSHMEM：大规模硬件加速基础架构的FPGA上的支持分区的全球地址空间

FSHMEM: Supporting Partitioned Global Address Space on FPGAs for Large-Scale Hardware Acceleration Infrastructure

论文作者

Arthanto, Yashael Faith, Ojika, David, Kim, Joo-Young

论文摘要

通过提供具有全球共享内存空间的高效单面通信，分区的全球地址空间（PGA）已成为高性能计算（HPC）中最有希望的并行计算模型之一。同时，FPGA将注意力作为HPC系统的替代计算平台，其受益于自定义计算和设计灵活性。但是，与传统消息传递界面不同，对PGA的探索尚未在FPGA上进行。本文提出了FSHMEM，这是一种软件/硬件框架，可在FPGA上实现PGAS编程模型。我们在硬件中为本机PGA集成在FPGA上实现GASNET规范的核心功能，而其编程接口旨在与Legacy软件高度兼容。我们的实验表明，FSHMEM达到了3813 MB/s的峰带宽，这是理论最大值的95％以上，表现优于先前的作品9.5 $ \ times $。它记录了0.35 $ us $和0.59 $ $ $ $ $ $ $的延迟，分别用于远程写入和阅读操作。最后，我们对两个Intel D5005 FPGA节点进行了案例研究，该节点整合了英特尔的深度学习加速器。由FSHMEM编程的两节点系统分别实现了1.94 $ \ times $和1.98 $ \ times $ $ speedup用于矩阵乘法和卷积操作，显示了其HPC基础架构的可扩展性潜力。

By providing highly efficient one-sided communication with globally shared memory space, Partitioned Global Address Space (PGAS) has become one of the most promising parallel computing models in high-performance computing (HPC). Meanwhile, FPGA is getting attention as an alternative compute platform for HPC systems with the benefit of custom computing and design flexibility. However, the exploration of PGAS has not been conducted on FPGAs, unlike the traditional message passing interface. This paper proposes FSHMEM, a software/hardware framework that enables the PGAS programming model on FPGAs. We implement the core functions of GASNet specification on FPGA for native PGAS integration in hardware, while its programming interface is designed to be highly compatible with legacy software. Our experiments show that FSHMEM achieves the peak bandwidth of 3813 MB/s, which is more than 95% of the theoretical maximum, outperforming the prior works by 9.5$\times$. It records 0.35$us$ and 0.59$us$ latency for remote write and read operations, respectively. Finally, we conduct a case study on the two Intel D5005 FPGA nodes integrating Intel's deep learning accelerator. The two-node system programmed by FSHMEM achieves 1.94$\times$ and 1.98$\times$ speedup for matrix multiplication and convolution operation, respectively, showing its scalability potential for HPC infrastructure.

下载PDF全文

下载文献需遵守相关版权规定

论文标题