论文标题
在HPX中实施软件弹性,以进行极端计算
Implementing Software Resiliency in HPX for Extreme Scale Computing
论文作者
论文摘要
由于硬件故障而导致的任务关键应用中发生的例外和错误具有很高的成本。随着新兴的下一代平台(NGP),硬件故障的速率将不断提高。因此,将我们的应用程序设计为弹性是一个关键问题,以保持结果的可靠性,同时满足电力预算的限制。在本文中,我们在HPX(一种异步的多任务运行时系统)中实现了软件弹性。我们实施了两个我们暴露于应用程序开发人员的弹性API,即任务复制和任务重播。任务复制重复任务n时间,并异步执行。任务重播将重新安排任务高达n时间,直到返回有效的输出为止。此外,我们引入了一个API,该API允许应用程序使用用户提供的谓词来验证返回的结果。我们使用人工工作负载和基于数据流的模板应用程序测试API。我们证明,在使用这些弹性功能的工作量大于200 $μ$ s时,只有较小的开销。我们还表明,大多数附加的执行时间是由任务本身的重播或复制而产生的,而不是由API的实现。
Exceptions and errors occurring within mission critical applications due to hardware failures have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware failures will invariably increase. Therefore, designing our applications to be resilient is a critical concern in order to retain the reliability of results while meeting the constraints on power budgets. In this paper, we implement software resilience in HPX, an Asynchronous Many-Task Runtime system. We implement two resiliency APIs that we expose to the application developers, namely task replication and task replay. Task replication repeats a task n-times and executes them asynchronously. Task replay will reschedule a task up to n-times until a valid output is returned. Furthermore, we introduce an API that allows the application to verify the returned result with a user provided predicate. We test the APIs with both artificial workloads and a dataflow based stencil application. We demonstrate that only minor overheads are incurred when utilizing these resiliency features for work loads where the task size is greater than 200 $μ$s. We also show that most of the added execution time arises from the replay or replication of the tasks themselves and not by the implementation of the APIs.