论文标题
FPGA上的快速任意精度浮点
Fast Arbitrary Precision Floating Point on FPGA
论文作者
论文摘要
需要任意精确浮点(APFP)数字的数值代码,其核心计算的数字由基本算术操作主导,这是由于Mantissa位数量的超级线性复杂性。由于缺乏本地硬件支持,因此对基于软件的架构进行的APFP计算变得非常昂贵,这需要使用在机器字块上操作的说明进行基本操作。在这项工作中,我们展示了如何将APFP乘法在本机DSP乘法的顶部进行递归定义的Karatsuba分解,以编译时固定前期操作数在Deep FPGA管道上实现。在比较我们在肺泡U250加速器上实施的设计与运行GNU多个精确浮动点可靠(MPFR)库的双插座36核Xeon节点时,我们在4.8 g/s时达到9.8倍的速度,用于512位乘法,在512位乘法下,在1.2 GOP/S 1.2 GOP/S BESS 1.2 GOP/s的速度上,乘以1024-BITIST的1024-BITISTIND乘以1024-BIT乘坐乘以。 191x CPU核心。我们将此体系结构应用于一般矩阵矩阵乘法,在Xeon节点上以2.0 GOP/s的速度产生10倍的速度,相当于超过375倍的CPU内核,有效地允许单个FPGA替换一个小的CPU群集。由于某些数值代码对APFP的显着依赖性(例如半决赛程序求解器),我们希望这些增长能够转化为现实世界中的加速。我们的可配置和灵活的基于HLS的代码作为开放源代码项目发表的高级软件接口提供了高级软件接口。
Numerical codes that require arbitrary precision floating point (APFP) numbers for their core computation are dominated by elementary arithmetic operations due to the super-linear complexity of multiplication in the number of mantissa bits. APFP computations on conventional software-based architectures are made exceedingly expensive by the lack of native hardware support, requiring elementary operations to be emulated using instructions operating on machine-word-sized blocks. In this work, we show how APFP multiplication on compile-time fixed-precision operands can be implemented as deep FPGA pipelines with a recursively defined Karatsuba decomposition on top of native DSP multiplication. When comparing our design implemented on an Alveo U250 accelerator to a dual-socket 36-core Xeon node running the GNU Multiple Precision Floating-Point Reliable (MPFR) library, we achieve a 9.8x speedup at 4.8 GOp/s for 512-bit multiplication, and a 5.3x speedup at 1.2 GOp/s for 1024-bit multiplication, corresponding to the throughput of more than 351x and 191x CPU cores, respectively. We apply this architecture to general matrix-matrix multiplication, yielding a 10x speedup at 2.0 GOp/s over the Xeon node, equivalent to more than 375x CPU cores, effectively allowing a single FPGA to replace a small CPU cluster. Due to the significant dependence of some numerical codes on APFP, such as semidefinite program solvers, we expect these gains to translate into real-world speedups. Our configurable and flexible HLS-based code provides as high-level software interface for plug-and-play acceleration, published as an open source project.