论文标题

使用非挥发性RAM恢复线性系统的分布式迭代求解器

Recovery of Distributed Iterative Solvers for Linear Systems Using Non-Volatile RAM

论文作者

Fridman, Yehonatan, Snir, Yaniv, Levin, Harel, Hendler, Danny, Attiya, Hagit, Oren, Gal

论文摘要

HPC系统是科学研究的关键资源。在Exascale时代,对计算能力和内存引诱器的需求不断增加,其中超级计算机旨在提供巨大的计算能力来满足这些需求。这些复杂的超级计算机由许多计算节点组成,因此有望经常出现故障和崩溃。 尤其是数学求解器,迭代线性求解器是许多大规模科学应用中的关键构件。因此,支持分布式求解器的恢复是将科学应用程序扩展到Exascale平台所必需的。迭代求解器的先前恢复方法基于检查点 - 重点(CR),它会导致高容错的开销或固有的容错公差,这需要额外的计算时间才能在故障后收敛。 提出了确切的状态重建(ESR)作为减轻频繁失败对长期计算的影响的替代机制。 ESR已被证明可提供对计算状态的精确重建,同时避免需要昂贵的检查点。但是,ESR当前依靠挥发性内存来实现容错,因此必须在多个节点的RAM中保持冗余,从而产生高内存和网络开销。 最近的超级计算机设计具有新兴的非易失性RAM(NVRAM)技术。本文研究了如何利用NVRAM来设计一种增强的基于ESR的恢复机制,该机制更有效并提供了充分的弹性。我们的机制称为NVRAM ESR,基于RDMA实施的新型MPI单方面通信(OSC),并提供了完全弹性的,同时与原始ESR设计(RAM INAM ESR)相比,大大降低了内存足迹和时间开销。

HPC systems are a critical resource for scientific research. The increased demand for computational power and memory ushers in the exascale era, in which supercomputers are designed to provide enormous computing power to meet these needs. These complex supercomputers consist of numerous compute nodes and are consequently expected to experience frequent faults and crashes. Mathematical solvers, in particular, iterative linear solvers are key building block in numerous large-scale scientific applications. Consequently, supporting the recovery of distributed solvers is necessary for scaling scientific applications to exascale platforms. Previous recovery methods for iterative solvers are based on Checkpoint-Restart (CR), which incurs high fault tolerance overhead, or intrinsic fault tolerance, which require extra computation time to converge after failures. Exact state reconstruction (ESR) was proposed as an alternative mechanism to alleviate the impact of frequent failures on long-term computations. ESR has been shown to provide exact reconstruction of the computation state while avoiding the need for costly checkpointing. However, ESR currently relies on volatile memory for fault tolerance, and must therefore maintain redundancies in the RAM of multiple nodes, incurring high memory and network overheads. Recent supercomputer designs feature emerging non-volatile RAM (NVRAM) technology. This paper investigates how NVRAM can be utilized to devise an enhanced ESR-based recovery mechanism that is more efficient and provides full resilience. Our mechanism, called in-NVRAM ESR, is based on a novel MPI One-Sided Communication (OSC) over RDMA implementation, and provides full resiliency while significantly reducing both the memory footprint and the time overhead in comparison with the original ESR design (in-RAM ESR).

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源