FPGA群集上的OPENCL分子动力学的粒子网格ewal

论文标题

FPGA群集上的OPENCL分子动力学的粒子网格ewal

Particle Mesh Ewald for Molecular Dynamics in OpenCL on an FPGA Cluster

论文作者

Stewart, Lawrence C., Pascoe, Carlo, Sherman, Brian W., Herbordt, Martin, Sachdeva, Vipin

论文摘要

分子动力学（MD）模拟在物理驱动的药物发现中起着核心作用。 MD应用通常使用粒子网埃瓦尔德（PME）算法来加速静电力计算，但是由于分布式3D FFT的高通信需求，有效的并行性被证明很困难。在本文中，我们介绍了在Intel Stratix 10 FPGA群体上运行的可扩展PME算法的设计和实现，并且可以处理适合于现实世界中药物发现项目的FFT尺寸（网格高达$ 128^3 $）。据我们所知，这是第一项完全整合PME算法（电荷扩展，3D FFT/IFFT和强制插值）的所有方面的工作。该设计已通过OpenCL充分实施，以灵活性和易于开发，并使用100 GBPS链接进行直接FPGA-FPGA通信，而无需主机交互。我们提供了高达4个FPGA的实验数据（例如，对于65536原子模拟，每个时间步度为206微秒，$ 64^3 $ 3D FFT），表现优于GPU。此外，我们讨论了具有高达64个FPGA（预期性能大于所有已知的GPU实现的预期性能）的簇的设计可扩展性，并与其他硬件组件集成在一起，以形成完整的分子动力学应用。我们预测64个FPGA的每个时间步度为6.6微秒的最佳性能。

Molecular Dynamics (MD) simulations play a central role in physics-driven drug discovery. MD applications often use the Particle Mesh Ewald (PME) algorithm to accelerate electrostatic force computations, but efficient parallelization has proven difficult due to the high communication requirements of distributed 3D FFTs. In this paper, we present the design and implementation of a scalable PME algorithm that runs on a cluster of Intel Stratix 10 FPGAs and can handle FFT sizes appropriate to address real-world drug discovery projects (grids up to $128^3$). To our knowledge, this is the first work to fully integrate all aspects of the PME algorithm (charge spreading, 3D FFT/IFFT, and force interpolation) within a distributed FPGA framework. The design is fully implemented with OpenCL for flexibility and ease of development and uses 100 Gbps links for direct FPGA-to-FPGA communications without the need for host interaction. We present experimental data up to 4 FPGAs (e.g., 206 microseconds per timestep for a 65536 atom simulation and $64^3$ 3D FFT), outperforming GPUs. Additionally, we discuss design scalability on clusters with differing topologies up to 64 FPGAs (with expected performance greater than all known GPU implementations) and integration with other hardware components to form a complete molecular dynamics application. We predict best-case performance of 6.6 microseconds per timestep on 64 FPGAs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题