论文标题
基础模型组鲁棒性的对比适配器
Contrastive Adapters for Foundation Model Group Robustness
论文作者
论文摘要
虽然大型审慎的基础模型(FMS)对数据集级别的分布变化显示出显着的零射击分类鲁棒性,但它们对亚群或组偏移的稳健性相对却相对较高。我们研究了这个问题,发现诸如剪辑之类的FMS可能对各种群体转移可能不健壮。在9个鲁棒性基准中,零摄像分类及其嵌入的分类导致平均和最差组精度之间的差距高达80.7个百分点(PP)。不幸的是,现有的改善鲁棒性的方法需要重新培训,这在大型基础模型上可能非常昂贵。我们还发现,改善模型推断的有效方法(例如,通过适配器,以FM嵌入为输入为输入的轻量级网络)不会持续改进,并且与零击相比有时会伤害组鲁棒性(例如,将精度差距提高到Celeba上的50.1 pp)。因此,我们制定了一种适配器培训策略,以有效有效地改善FM组鲁棒性。我们激励的观察结果是,尽管同一阶级中的群体中较差的鲁棒性在“嵌入空间”中分开的群体中,但标准适配器训练可能不会使这些要点更加紧密。因此,我们提出了对比度适应,该对比度适应以对比度学习训练适配器,以使样品嵌入在同一类中接近其地面真相类嵌入和其他样品嵌入。在整个9个基准测试中,我们的方法始终提高组鲁棒性,使最差的组准确度提高了8.5至56.0 pp。我们的方法也是有效的,这样做的方法也没有任何FM FINETUNTUNET,并且只有一组固定的冷冻FM嵌入。在水鸟和Celeba之类的基准上,这会导致最差的组精度可与最先进的方法相媲美,而最先进的方法可以重新训练整个模型,而仅训练$ \ leq $ 1%的模型参数。
While large pretrained foundation models (FMs) have shown remarkable zero-shot classification robustness to dataset-level distribution shifts, their robustness to subpopulation or group shifts is relatively underexplored. We study this problem, and find that FMs such as CLIP may not be robust to various group shifts. Across 9 robustness benchmarks, zero-shot classification with their embeddings results in gaps of up to 80.7 percentage points (pp) between average and worst-group accuracy. Unfortunately, existing methods to improve robustness require retraining, which can be prohibitively expensive on large foundation models. We also find that efficient ways to improve model inference (e.g., via adapters, lightweight networks with FM embeddings as inputs) do not consistently improve and can sometimes hurt group robustness compared to zero-shot (e.g., increasing the accuracy gap by 50.1 pp on CelebA). We thus develop an adapter training strategy to effectively and efficiently improve FM group robustness. Our motivating observation is that while poor robustness results from groups in the same class being embedded far apart in the foundation model "embedding space," standard adapter training may not bring these points closer together. We thus propose contrastive adapting, which trains adapters with contrastive learning to bring sample embeddings close to both their ground-truth class embeddings and other sample embeddings in the same class. Across the 9 benchmarks, our approach consistently improves group robustness, raising worst-group accuracy by 8.5 to 56.0 pp over zero-shot. Our approach is also efficient, doing so without any FM finetuning and only a fixed set of frozen FM embeddings. On benchmarks such as Waterbirds and CelebA, this leads to worst-group accuracy comparable to state-of-the-art methods that retrain entire models, while only training $\leq$1% of the model parameters.