降低AI和深度学习加速器的产量损失和测试

论文标题

降低AI和深度学习加速器的产量损失和测试

Yield Loss Reduction and Test of AI and Deep Learning Accelerators

论文作者

Sadi, Mehdi, Guin, Ujjwal

论文摘要

随着数据驱动的分析成为主流，全球对专用AI和深度学习加速器芯片的需求正在飙升。这些加速器设计具有密集的处理元件（PE），特别容易受到高级半导体过程节点中常见的制造缺陷和功能故障的影响，从而导致显着的产率损失。在这项工作中，我们演示了一种以应用为驱动的方法来融合AI加速器芯片的方法，并通过将加速器中PES中的电路故障与目标AI工作负载的准确性相关联，通过将电路故障与PES的电路故障相关联。我们利用训练有素的深度学习模型的固有容错特征，以及选择性失活PE的策略，以开发出降低的产量损失和测试方法。故障位置，故障率和AI任务的准确性是确定加速器芯片是否可以通过最终产率测试之间的分析关系。为乘积和累积的PES单位提供了减少损失的意识隔离，ATPG和测试流。用广泛使用的AI/深度学习基准获得的结果表明，加速器可以在PE阵列中维持5％的故障率，同时遭受精确度损失少于1％的损失，从而促进了这些芯片的产品固定和降低这些芯片的产量损失。

With data-driven analytics becoming mainstream, the global demand for dedicated AI and Deep Learning accelerator chips is soaring. These accelerators, designed with densely packed Processing Elements (PE), are especially vulnerable to the manufacturing defects and functional faults common in the advanced semiconductor process nodes resulting in significant yield loss. In this work, we demonstrate an application-driven methodology of binning the AI accelerator chips, and yield loss reduction by correlating the circuit faults in the PEs of the accelerator with the desired accuracy of the target AI workload. We exploit the inherent fault tolerance features of trained deep learning models and a strategy of selective deactivation of faulty PEs to develop the presented yield loss reduction and test methodology. An analytical relationship is derived between fault location, fault rate, and the AI task's accuracy for deciding if the accelerator chip can pass the final yield test. A yield-loss reduction aware fault isolation, ATPG, and test flow are presented for the multiply and accumulate units of the PEs. Results obtained with widely used AI/deep learning benchmarks demonstrate that the accelerators can sustain 5% fault-rate in PE arrays while suffering from less than 1% accuracy loss, thus enabling product-binning and yield loss reduction of these chips.

下载PDF全文

下载文献需遵守相关版权规定

论文标题