论文标题
$ da^3 $:动态添加性注意适应记忆效率 - 设备多域学习
$DA^3$:Dynamic Additive Attention Adaption for Memory-EfficientOn-Device Multi-Domain Learning
论文作者
论文摘要
如今,深神经网络(DNN)的一个实际局限性是其对单个任务或域的高度专业化(例如一个视觉域)。它激发了研究人员开发可以依次将DNN模型适应多个领域的算法,同时仍然在过去的域上表现良好,这被称为多域学习。几乎所有常规方法仅专注于通过最小的参数更新提高准确性,同时忽略了培训期间的高计算和内存成本,这使得将多域学习部署到越来越广泛使用的资源有限的边缘设备,例如手机,IoT,IoT,嵌入式系统等。我们观察到用于激活的大量存储是培训的范围,而瓶装的范围是瓶装的限制。为了减少训练记忆使用量,在保持域的适应精度性能的同时,我们提出了动态添加注意适应性($ da^3 $),这是一种新型的记忆效率上设备的多域学习方法。 $ da^3 $学习了一个新颖的添加剂注意适配器模块,同时冻结了每个域的预训练骨干模型的重量。与先前的工作区分开来,这种模块不仅减轻激活记忆缓冲,以减少训练过程中的存储器使用情况,而且还可以作为动态门控机制,以降低快速推理的计算成本。我们在多个数据集上验证了$ da^3 $,可针对最先进的方法进行验证,这在准确性和训练时间上都显示出很大的提高。此外,我们将$ da^3 $部署到了流行的Nivdia Jetson Nano Edge GPU中,其中测量的实验结果表明,与基线方法相比,我们提出的$ da^3 $减少了19-37倍的设备培训记忆消耗,而2倍培训时间(例如,标准的微调,平行和系列researsal和res pig。
Nowadays, one practical limitation of deep neural network (DNN) is its high degree of specialization to a single task or domain (e.g., one visual domain). It motivates researchers to develop algorithms that can adapt DNN model to multiple domains sequentially, while still performing well on the past domains, which is known as multi-domain learning. Almost all conventional methods only focus on improving accuracy with minimal parameter update, while ignoring high computing and memory cost during training, which makes it difficult to deploy multi-domain learning into more and more widely used resource-limited edge devices, like mobile phones, IoT, embedded systems, etc. We observe that large memory used for activation storage is the bottleneck that largely limits the training time and cost on edge devices. To reduce training memory usage, while keeping the domain adaption accuracy performance, we propose Dynamic Additive Attention Adaption ($DA^3$), a novel memory-efficient on-device multi-domain learning method. $DA^3$ learns a novel additive attention adaptor module, while freezing the weights of the pre-trained backbone model for each domain. Differentiating from prior works, such module not only mitigates activation memory buffering for reducing memory usage during training but also serves as a dynamic gating mechanism to reduce the computation cost for fast inference. We validate $DA^3$ on multiple datasets against state-of-the-art methods, which shows great improvement in both accuracy and training time. Moreover, we deployed $DA^3$ into the popular NIVDIA Jetson Nano edge GPU, where the measured experimental results show our proposed $DA^3$ reduces the on-device training memory consumption by 19-37x, and training time by 2x, in comparison to the baseline methods (e.g., standard fine-tuning, Parallel and Series Res. adaptor, and Piggyback).