论文标题
在安全的大型HPC生产系统上,以PETAFLOP量表部署科学AI网络
Deploying Scientific AI Networks at Petaflop Scale on Secure Large Scale HPC Production Systems with Containers
论文作者
论文摘要
对计算能力的需求不断增加,以训练复杂的人工智能(AI)和机器学习(ML)模型来解决大型科学问题。需要高性能计算(HPC)资源来有效地计算和扩展数万个计算节点的复杂模型。在本文中,我们讨论了与大规模安全HPC系统部署机器学习框架相关的问题,以及我们如何在安全的大型HPC生产系统上成功部署标准的机器学习框架,以培训复杂的三维卷积GAN(3DGAN),并以Petaflop的性能进行培训。 3dgan是高能物理域中的一个例子,旨在模拟各种HPC系统上粒子探测器内的二级颗粒阵阵产生的能量模式。
There is an ever-increasing need for computational power to train complex artificial intelligence (AI) & machine learning (ML) models to tackle large scientific problems. High performance computing (HPC) resources are required to efficiently compute and scale complex models across tens of thousands of compute nodes. In this paper, we discuss the issues associated with the deployment of machine learning frameworks on large scale secure HPC systems and how we successfully deployed a standard machine learning framework on a secure large scale HPC production system, to train a complex three-dimensional convolutional GAN (3DGAN), with petaflop performance. 3DGAN is an example from the high energy physics domain, designed to simulate the energy pattern produced by showers of secondary particles inside a particle detector on various HPC systems.