论文标题

部分可观测时空混沌系统的无模型预测

LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning

论文作者

Sung, Yi-Lin, Cho, Jaemin, Bansal, Mohit

论文摘要

最近在各种领域中采用了对下游任务的微型预训练模型。但是,更新大型预训练模型的整个参数集是昂贵的。尽管最近提出的参数效率转移学习(PETL)技术允许在预先训练的骨干网络内更新一小部分参数(例如,仅使用2%的参数)进行新任务,但它们仅将训练记忆要求最多减少30%。这是因为可训练参数的梯度计算仍然需要通过大型预训练的骨干模型进行反向传播。为了解决这个问题,我们提出了梯子侧调(LST),这是一种新的PETL技术,可以将训练记忆要求减少更多。与现有的参数效率方法不同,将其他参数插入骨干网络中,我们训练梯子侧网络,梯子侧网络是一个小而独立的网络,将中间激活作为通过快捷键连接(称为梯子)的输入(称为梯子),并进行预测。 LST的内存要求明显低于以前的方法,因为它不需要通过骨干网络反向传播,而是仅通过侧网和梯子连接。我们使用NLP(胶)(胶)和视觉和语言(VQA,GQA,NLVR2,MSCOCO)任务的各种模型(T5和clip-T5)评估我们的方法。 LST节省了69%的内存成本来微调整个网络,而其他方法仅将其中的26%保存在相似的参数使用中(因此,多2.7倍的内存节省)。此外,LST在低内存状态下的适配器和洛拉的精度高于适配器。为了进一步显示这种更好的记忆效率的优势,我们还将LST应用于较大的T5模型,比完整的微调和其他PETL方法获得更好的胶水性能。准确性效率的权衡也有VL任务。

Fine-tuning large pre-trained models on downstream tasks has been adopted in a variety of domains recently. However, it is costly to update the entire parameter set of large pre-trained models. Although recently proposed parameter-efficient transfer learning (PETL) techniques allow updating a small subset of parameters (e.g. only using 2% of parameters) inside a pre-trained backbone network for a new task, they only reduce the training memory requirement by up to 30%. This is because the gradient computation for the trainable parameters still requires backpropagation through the large pre-trained backbone model. To address this, we propose Ladder Side-Tuning (LST), a new PETL technique that can reduce training memory requirements by more substantial amounts. Unlike existing parameter-efficient methods that insert additional parameters inside backbone networks, we train a ladder side network, a small and separate network that takes intermediate activations as input via shortcut connections (called ladders) from backbone networks and makes predictions. LST has significantly lower memory requirements than previous methods, because it does not require backpropagation through the backbone network, but instead only through the side network and ladder connections. We evaluate our method with various models (T5 and CLIP-T5) on both NLP (GLUE) and vision-and-language (VQA, GQA, NLVR2 , MSCOCO) tasks. LST saves 69% of the memory costs to fine-tune the whole network, while other methods only save 26% of that in similar parameter usages (hence, 2.7x more memory savings). Moreover, LST achieves higher accuracy than Adapter and LoRA in a low-memory regime. To further show the advantage of this better memory efficiency, we also apply LST to larger T5 models, attaining better GLUE performance than full fine-tuning and other PETL methods. The accuracy-efficiency trade-off also holds on VL tasks.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源