256KB记忆下的设备训练

论文标题

256KB记忆下的设备训练

On-Device Training Under 256KB Memory

论文作者

Lin, Ji, Zhu, Ligeng, Chen, Wei-Ming, Wang, Wei-Chen, Gan, Chuang, Han, Song

论文摘要

在设备训练中，该模型可以通过微调预训练的模型来适应从传感器中收集的新数据。用户可以从自定义的AI模型中受益，而无需将数据传输到云中，从而保护隐私。但是，对于具有少量内存资源的物联网设备，训练记忆消耗均过高。我们提出了一个算法 - 系统共同设计框架，仅使用256KB的内存使设备训练成为可能。在设备上训练面临两个独特的挑战：（1）由于比特精确率低和缺乏归一化，很难优化神经网络的量化图；（2）有限的硬件资源不允许全部反向传播。为了应对优化难度，我们提出了量化缩放量表来校准梯度尺度并稳定8位量化训练。为了减少内存足迹，我们提出稀疏更新，以跳过不太重要的层和子调整器的梯度计算。该算法创新是由轻量级训练系统（小型训练引擎）实现的，该系统可修剪向后的计算图，以支持稀疏更新并卸载运行时自动分化以编译时间。我们的框架是在256KB SRAM和1MB闪存下进行卷积神经网络进行微小的验证训练，而无需辅助存储器，同时使用少于1/1000的Pytorch和Tensorflow内存，同时匹配Tinyml应用程序上的精度。我们的研究使IoT设备不仅可以执行推理，还可以不断适应新的数据，以进行内在的终身学习。可以在此处找到视频演示：https：//youtu.be/0pufzydomy8。

On-device training enables the model to adapt to new data collected from the sensors by fine-tuning a pre-trained model. Users can benefit from customized AI models without having to transfer the data to the cloud, protecting the privacy. However, the training memory consumption is prohibitive for IoT devices that have tiny memory resources. We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory. On-device training faces two unique challenges: (1) the quantized graphs of neural networks are hard to optimize due to low bit-precision and the lack of normalization; (2) the limited hardware resource does not allow full back-propagation. To cope with the optimization difficulty, we propose Quantization-Aware Scaling to calibrate the gradient scales and stabilize 8-bit quantized training. To reduce the memory footprint, we propose Sparse Update to skip the gradient computation of less important layers and sub-tensors. The algorithm innovation is implemented by a lightweight training system, Tiny Training Engine, which prunes the backward computation graph to support sparse updates and offload the runtime auto-differentiation to compile time. Our framework is the first solution to enable tiny on-device training of convolutional neural networks under 256KB SRAM and 1MB Flash without auxiliary memory, using less than 1/1000 of the memory of PyTorch and TensorFlow while matching the accuracy on tinyML application VWW. Our study enables IoT devices not only to perform inference but also to continuously adapt to new data for on-device lifelong learning. A video demo can be found here: https://youtu.be/0pUFZYdoMY8.

下载PDF全文

下载文献需遵守相关版权规定

论文标题