在RISC-V极端节点上进行记忆延迟准确性的权衡

论文标题

在RISC-V极端节点上进行记忆延迟准确性的权衡

Memory-Latency-Accuracy Trade-offs for Continual Learning on a RISC-V Extreme-Edge Node

论文作者

Ravaglia, Leonardo, Rusci, Manuele, Capotondi, Alessandro, Conti, Francesco, Pellegrini, Lorenzo, Lomonaco, Vincenzo, Maltoni, Davide, Benini, Luca

论文摘要

目前，由AI驱动的边缘设备缺乏将其嵌入式推理模型调整到不断变化的环境中的能力。为了解决这个问题，持续学习（CL）策略旨在根据新获得的数据逐步提高决策能力。在这项工作中，在量化了CL算法的内存和计算要求之后，我们定义了一个新颖的HW/SW极端平台，该平台具有低功率RISC-RISC-V八核群集，该平台量身定制了针对本地感官数据的按需增量学习量身定制的。当梯度下降的向前和向后步骤时，呈现的多核HW/SW体系结构分别达到2.21和1.70 MAC/循环。我们报告记忆足迹，延迟和准确性之间的权衡取舍，用于在核心50数据集上定位图像分类任务时学习带有潜在重播CL的新课程。对于重新训练所有层的CL设置，需要5H学习新类，并达到精度的77.3％，更有效的解决方案仅重新训练网络的一部分，达到72.5％的精度，记忆要求为300 MB，计算延迟为1.5小时。另一方面，仅重新训练最后一层会导致最快（867毫秒）和更少的内存饥饿（20 MB）解决方案，但在Core50数据集上得分为58％。得益于低功率群集引擎的并行性，我们的HW/SW平台的结果比典型的MCU设备快25倍，而CL仍然不切实际，并且在移动级解决方案方面的能源消耗增长了11倍。

AI-powered edge devices currently lack the ability to adapt their embedded inference models to the ever-changing environment. To tackle this issue, Continual Learning (CL) strategies aim at incrementally improving the decision capabilities based on newly acquired data. In this work, after quantifying memory and computational requirements of CL algorithms, we define a novel HW/SW extreme-edge platform featuring a low power RISC-V octa-core cluster tailored for on-demand incremental learning over locally sensed data. The presented multi-core HW/SW architecture achieves a peak performance of 2.21 and 1.70 MAC/cycle, respectively, when running forward and backward steps of the gradient descent. We report the trade-off between memory footprint, latency, and accuracy for learning a new class with Latent Replay CL when targeting an image classification task on the CORe50 dataset. For a CL setting that retrains all the layers, taking 5h to learn a new class and achieving up to 77.3% of precision, a more efficient solution retrains only part of the network, reaching an accuracy of 72.5% with a memory requirement of 300 MB and a computation latency of 1.5 hours. On the other side, retraining only the last layer results in the fastest (867 ms) and less memory hungry (20 MB) solution but scoring 58% on the CORe50 dataset. Thanks to the parallelism of the low-power cluster engine, our HW/SW platform results 25x faster than typical MCU device, on which CL is still impractical, and demonstrates an 11x gain in terms of energy consumption with respect to mobile-class solutions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题