论文标题

LCFI:用于研究HPC程序中有损压缩误差传播的断层注入工具

LCFI: A Fault Injection Tool for Studying Lossy Compression Error Propagation in HPC Programs

论文作者

Shan, Baodi, Shamji, Aabid, Tian, Jiannan, Li, Guanpeng, Tao, Dingwen

论文摘要

由于越来越多的数据,由于它已广泛用于现场可视化,数据流降低,I/O性能改进,检查点/重新重新启动的损失,损失的损失以及优化比率,质量比率,因此,错误的损耗压缩对当今的极端HPC应用越来越重要,因为它已被广泛用于现场可视化,数据流降低,I/O性能改善,降低了许多工程,并且在此无效的压缩等方面已广泛使用。由于错误传播,试图系统地了解有损压缩错误对HPC应用的影响的现有作品。 在本文中,我们提出并开发了一种称为LCFI的有损压缩断层注射工具。据我们所知,这是第一个有助于有损压缩机开发人员和用户的故障注入工具,可以系统地,全面地了解有损压缩错误对HPC程序的影响。这项工作的贡献是三倍:(1)根据对不同最新压缩机的压缩误差的统计分析,我们提出了一种有效的方法来注射有损压缩误差。 (2)我们构建了一个故障注射器,该注射器高度适用,可自定义,易于使用自上而下的综合结果,并证明使用LCFI。 (3)我们对具有不同抽象故障模型的四个代表性HPC基准进行了LCFI评估,并对错误传播及其对程序输出的影响进行了一些观察。

Error-bounded lossy compression is becoming more and more important to today's extreme-scale HPC applications because of the ever-increasing volume of data generated because it has been widely used in in-situ visualization, data stream intensity reduction, storage reduction, I/O performance improvement, checkpoint/restart acceleration, memory footprint reduction, etc. Although many works have optimized ratio, quality, and performance for different error-bounded lossy compressors, there is none of the existing works attempting to systematically understand the impact of lossy compression errors on HPC application due to error propagation. In this paper, we propose and develop a lossy compression fault injection tool, called LCFI. To the best of our knowledge, this is the first fault injection tool that helps both lossy compressor developers and users to systematically and comprehensively understand the impact of lossy compression errors on HPC programs. The contributions of this work are threefold: (1) We propose an efficient approach to inject lossy compression errors according to a statistical analysis of compression errors for different state-of-the-art compressors. (2) We build a fault injector which is highly applicable, customizable, easy-to-use in generating top-down comprehensive results, and demonstrate the use of LCFI. (3) We evaluate LCFI on four representative HPC benchmarks with different abstracted fault models and make several observations about error propagation and their impacts on program outputs.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源