论文标题
GPU机器上的扭曲质量合奏生成
Twisted mass ensemble generation on GPU machines
论文作者
论文摘要
我们介绍了如何通过将其最昂贵的零件卸载到QUDA库中的TMLQCD软件套件中的混合蒙特卡洛实施方式。我们讨论了我们在TMLQCD和QUDA中添加所需功能时遇到的动机以及一些技术挑战。我们进一步介绍了一些绩效细节,尤其是针对QUDA的多式求解器的使用,用于调节较差的轻夸克单位,以及用于使用扭曲的大规模分子效率的非分类奇怪和魅力部门的多偏变求解器,用于$ n_f = 2+1+1+1 $ $ n_f = 2+1+1 $ $,从而对CPU和GP进行了twisted the Clover fermions的模拟,并对其进行了比较。我们还通过基于AMD的MI250 GPU在机器上对HMC的初步测试来查看性能通用性问题,在经过非常小的额外移植工作后,您的性能良好。最后,我们得出的结论是,与仅在CPU上运行相比,我们应该能够实现当前一代(前)EXASCALE超级计算机可接受的GPU利用率因子,并具有底部效率的提高和实时速度。同时,我们发现未来的挑战将需要不同的方法,最重要的是,对于软件开发的人员投资非常重要。
We present how we ported the Hybrid Monte Carlo implementation in the tmLQCD software suite to GPUs through offloading its most expensive parts to the QUDA library. We discuss our motivations and some of the technical challenges that we encountered as we added the required functionality to both tmLQCD and QUDA. We further present some performance details, focussing in particular on the usage of QUDA's multigrid solver for poorly conditioned light quark monomials as well as the multi-shift solver for the non-degenerate strange and charm sector in $N_f=2+1+1$ simulations using twisted mass clover fermions, comparing the efficiency of state-of-the-art simulations on CPU and GPU machines. We also take a look at the performance-portability question through preliminary tests of our HMC on a machine based on AMD's MI250 GPU, finding good performance after a very minor additional porting effort. Finally, we conclude that we should be able to achieve GPU utilisation factors acceptable for the current generation of (pre-)exascale supercomputers with subtantial efficiency improvements and real time speedups compared to just running on CPUs. At the same time, we find that future challenges will require different approaches and, most importantly, a very significant investment of personnel for software development.