现实世界深神经网络的灵活端到端推理的异构内存计算群集

论文标题

现实世界深神经网络的灵活端到端推理的异构内存计算群集

A Heterogeneous In-Memory Computing Cluster For Flexible End-to-End Inference of Real-World Deep Neural Networks

论文作者

Garofalo, Angelo, Ottavi, Gianmarco, Conti, Francesco, Karunaratne, Geethan, Boybat, Irem, Benini, Luca, Rossi, Davide

论文摘要

在小电池约束的物联网设备上部署现代的Tinyml任务需要高度的计算能效。使用非挥发记忆（NVM）的模拟内存计算（IMC）有望改善深神经网络（DNN）推断，并用作DNN权重的芯片内存存储。但是，IMC的功能灵活性限制及其对性能，能源和面积效率的影响尚未在系统级别上完全理解。为了定位实用的端到端物联网应用程序，必须将IMC阵列包含在异质的可编程系统中，并引入新的系统级别挑战，我们旨在解决这项工作中的解决方案。我们提出了一个异质的紧密耦合聚类结构，该体系结构集成了8个RISC-V内核，一个内存计算加速器（IMA）和数字加速器。与高度优化的核心平行执行相比，我们在高度异构工作量上基准测试了系统，例如MobilenetV2的瓶颈层，显示11.5倍性能和9.5倍的能源效率。此外，我们通过将我们的异质体系结构扩展到多阵列加速器来探讨完整移动级DNN（MobilenetV2）端到端推断的要求。我们的结果表明，在执行延迟方面，我们的解决方案比现有的可编程架构要好于一个数量级，而不是现有的可编程架构，而两个数量级比整合的先进异质解决方案要积分的内存计算核心。

Deployment of modern TinyML tasks on small battery-constrained IoT devices requires high computational energy efficiency. Analog In-Memory Computing (IMC) using non-volatile memory (NVM) promises major efficiency improvements in deep neural network (DNN) inference and serves as on-chip memory storage for DNN weights. However, IMC's functional flexibility limitations and their impact on performance, energy, and area efficiency are not yet fully understood at the system level. To target practical end-to-end IoT applications, IMC arrays must be enclosed in heterogeneous programmable systems, introducing new system-level challenges which we aim at addressing in this work. We present a heterogeneous tightly-coupled clustered architecture integrating 8 RISC-V cores, an in-memory computing accelerator (IMA), and digital accelerators. We benchmark the system on a highly heterogeneous workload such as the Bottleneck layer from a MobileNetV2, showing 11.5x performance and 9.5x energy efficiency improvements, compared to highly optimized parallel execution on the cores. Furthermore, we explore the requirements for end-to-end inference of a full mobile-grade DNN (MobileNetV2) in terms of IMC array resources, by scaling up our heterogeneous architecture to a multi-array accelerator. Our results show that our solution, on the end-to-end inference of the MobileNetV2, is one order of magnitude better in terms of execution latency than existing programmable architectures and two orders of magnitude better than state-of-the-art heterogeneous solutions integrating in-memory computing analog cores.

下载PDF全文

下载文献需遵守相关版权规定

论文标题