与变压器进行图像分类的持续学习

论文标题

与变压器进行图像分类的持续学习

Continual Learning with Transformers for Image Classification

论文作者

Ermis, Beyza, Zappella, Giovanni, Wistuba, Martin, Rawal, Aditya, Archambeau, Cedric

论文摘要

在许多实际情况下，随着时间的推移，用于训练机器学习模型的数据将获得。但是，神经网络模型难以在不忘记过去学到的东西的情况下不断学习新概念。这种现象被称为灾难性遗忘，由于实际限制，例如可以存储的数据量或可以使用的有限计算源，通常很难预防。此外，从头开始训练大型神经网络（例如变形金刚）非常昂贵，需要大量的培训数据，这可能在感兴趣的应用程序领域中不可用。最近的趋势表明，基于参数扩展的动态体系结构可以在持续学习中有效地减少灾难性遗忘，但是这种需要复杂的调整以平衡不断增长的参数，并且几乎无法在任务中共享任何信息。结果，他们努力地扩展到没有大量开销的大量任务。在本文中，我们在计算机视觉域中验证了一种最新的解决方案，称为适配器的自适应蒸馏（ADA），该解决方案是为了在文本分类任务上使用预训练的变压器和适配器进行持续学习。我们在不同的分类任务上进行了经验证明，此方法在不进行模型或增加模型参数数量的情况下保持良好的预测性能。此外，与最先进的方法相比，推理时间的速度要快得多。

In many real-world scenarios, data to train machine learning models become available over time. However, neural network models struggle to continually learn new concepts without forgetting what has been learnt in the past. This phenomenon is known as catastrophic forgetting and it is often difficult to prevent due to practical constraints, such as the amount of data that can be stored or the limited computation sources that can be used. Moreover, training large neural networks, such as Transformers, from scratch is very costly and requires a vast amount of training data, which might not be available in the application domain of interest. A recent trend indicates that dynamic architectures based on an expansion of the parameters can reduce catastrophic forgetting efficiently in continual learning, but this needs complex tuning to balance the growing number of parameters and barely share any information across tasks. As a result, they struggle to scale to a large number of tasks without significant overhead. In this paper, we validate in the computer vision domain a recent solution called Adaptive Distillation of Adapters (ADA), which is developed to perform continual learning using pre-trained Transformers and Adapters on text classification tasks. We empirically demonstrate on different classification tasks that this method maintains a good predictive performance without retraining the model or increasing the number of model parameters over the time. Besides it is significantly faster at inference time compared to the state-of-the-art methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题