通过基于算法的错误检测技术使卷积弹性

论文标题

通过基于算法的错误检测技术使卷积弹性

Making Convolutions Resilient via Algorithm-Based Error Detection Techniques

论文作者

Hari, Siva Kumar Sastry, Sullivan, Michael B., Tsai, Timothy, Keckler, Stephen W.

论文摘要

卷积神经网络（CNN）准确处理实时遥测的能力增强了它们在安全至关重要和高性能计算系统中的使用。由于这样的系统需要高水平的弹性，因此CNN必须在存在硬件故障的情况下正确执行。完整的重复提供了所需的保证，但造成了100％的开销。已知算法技术可提供低成本的解决方案，但是从未研究过此类技术的实际可行性和性能（例如，GPU上的Tensorflow或Tensorrt）。在本文中，我们关注算法验证卷积，这是CNN中资源最高的操作。我们使用校验和验证卷积，增加了少量的冗余，远不及全功能。我们首先确定在优化的推理平台中使用基于算法的错误检测（ABED）在融合多个网络层并使用减少精确操作的优化推理平台中出现的挑战，并演示了如何克服它们。我们提出和评估ABED技术的变化，这些变化提供了实施复杂性，运行时开销和覆盖范围的权衡。结果表明，ABED可以检测所有可能损坏输出的瞬态硬件错误，并在产生低运行时开销（6-23％）的同时进行，与完整重复相比，工作量至少为1.6倍吞吐量。

The ability of Convolutional Neural Networks (CNNs) to accurately process real-time telemetry has boosted their use in safety-critical and high-performance computing systems. As such systems require high levels of resilience to errors, CNNs must execute correctly in the presence of hardware faults. Full duplication provides the needed assurance but incurs a prohibitive 100% overhead. Algorithmic techniques are known to offer low-cost solutions, but the practical feasibility and performance of such techniques have never been studied for CNN deployment platforms (e.g., TensorFlow or TensorRT on GPUs). In this paper, we focus on algorithmically verifying Convolutions, which are the most resource-demanding operations in CNNs. We use checksums to verify convolutions, adding a small amount of redundancy, far less than full-duplication. We first identify the challenges that arise in employing Algorithm-Based Error Detection (ABED) for Convolutions in optimized inference platforms that fuse multiple network layers and use reduced-precision operations, and demonstrate how to overcome them. We propose and evaluate variations of ABED techniques that offer implementation complexity, runtime overhead, and coverage trade-offs. Results show that ABED can detect all transient hardware errors that might otherwise corrupt output and does so while incurring low runtime overheads (6-23%), offering at least 1.6X throughput to workloads compared to full duplication.

下载PDF全文

下载文献需遵守相关版权规定

论文标题