论文标题
学习域不变提示,即视觉语言模型
Learning Domain Invariant Prompt for Vision-Language Models
论文作者
论文摘要
迅速学习是通过使用很少的样本调整可学习的及时向量来调整强大的视觉语言基础模型(例如剪辑)的最有效和热门方式之一。但是,尽管及时的学习在内域数据上取得了出色的表现,但它仍然面临着概括不见的类和领域的主要挑战。一些现有的及时学习方法通过适应为不同的令牌或域的不同提示来解决这个问题,但忽略了学到的提示能力概括到看不见的域。在本文中,我们提出了一种新颖的提示学习范式,该范式直接生成\ emph {域不变}提示,该提示可以推广到看不见的域,称为Metaprompt。具体来说,提出了双模式提示调谐网络,以生成来自图像和文本模式的输入提示。借助新的不对称对比损失,原始预训练的视觉模型的表示可以作为增强学习提示的概括能力的监督。更重要的是,我们提出了一种基于元学习的提示调整算法,该算法明确约束针对一个域或类的特定任务提示,以在另一个域或类中实现良好的性能。在11个数据集上进行了大量实验,用于基础到新的概括,而4个用于域泛化的数据集表明,我们的方法一致地均超过现有方法。
Prompt learning is one of the most effective and trending ways to adapt powerful vision-language foundation models like CLIP to downstream datasets by tuning learnable prompt vectors with very few samples. However, although prompt learning achieves excellent performance over in-domain data, it still faces the major challenge of generalizing to unseen classes and domains. Some existing prompt learning methods tackle this issue by adaptively generating different prompts for different tokens or domains but neglecting the ability of learned prompts to generalize to unseen domains. In this paper, we propose a novel prompt learning paradigm that directly generates \emph{domain invariant} prompt that can be generalized to unseen domains, called MetaPrompt. Specifically, a dual-modality prompt tuning network is proposed to generate prompts for input from both image and text modalities. With a novel asymmetric contrastive loss, the representation from the original pre-trained vision-language model acts as supervision to enhance the generalization ability of the learned prompt. More importantly, we propose a meta-learning-based prompt tuning algorithm that explicitly constrains the task-specific prompt tuned for one domain or class to also achieve good performance in another domain or class. Extensive experiments on 11 datasets for base-to-new generalization and 4 datasets for domain generalization demonstrate that our method consistently and significantly outperforms existing methods.