论文标题
视觉模型中的任务偏见
Task Bias in Vision-Language Models
论文作者
论文摘要
语言的偶然监督已成为学习通用视觉表示的流行方法,可以提示在计算机视觉中执行许多识别任务。我们对剪辑模型进行了深入的探索,并表明其视觉表示通常比其他任务更偏向于解决某些任务。此外,该表示形式将偏向于哪个任务是不可预测的,并且在图像之间几乎没有一致性。为了解决此任务偏见,我们展示了如何学习一个视觉提示,该提示指导表示与他们感兴趣的任务相关的特征。我们的结果表明,这些视觉提示可以独立于输入图像,并且仍然有效地提供了一种调理机制,可以将视觉表示为所需的任务。
Incidental supervision from language has become a popular approach for learning generic visual representations that can be prompted to perform many recognition tasks in computer vision. We conduct an in-depth exploration of the CLIP model and show that its visual representation is often strongly biased towards solving some tasks more than others. Moreover, which task the representation will be biased towards is unpredictable, with little consistency across images. To resolve this task bias, we show how to learn a visual prompt that guides the representation towards features relevant to their task of interest. Our results show that these visual prompts can be independent of the input image and still effectively provide a conditioning mechanism to steer visual representations towards the desired task.