论文标题
在自我监督的视觉变形金刚中提出的知识引导的关注
Prior Knowledge-Guided Attention in Self-Supervised Vision Transformers
论文作者
论文摘要
自我监督的表示学习的最新趋势集中在消除训练管道中的归纳偏见。但是,当可用数据有限时,归纳偏差在设置中可能很有用,或者对基础数据分布提供更多见解。我们提出了空间注意力(SPAN),该框架利用未标记的图像数据集中使用一致的空间和语义结构来指导视觉变压器的注意。 SPAN通过将注意力面罩从单独的变压器头开始进行操作,以跟随语义区域的各个先验。这些先验可以从数据统计数据或域专家提供的单个标记样本中得出。我们研究了几个详细的现实情况,包括医学图像分析和视觉质量保证。我们发现,所产生的注意力面罩比从域 - 不可吻合的预处理中得出的掩模更容易解释。 SPAN可为肺和心脏分割产生58.7的地图改进。我们还发现,与领域 - 不合稳定的预处理相比,我们的方法在将验证的模型转移到下游胸部疾病分类任务时会产生2.2个MAUC的改善。最后,我们表明,与域 - 不可吻合的预处理相比,跨越预处理会导致低数据表格中的下游分类性能更高。
Recent trends in self-supervised representation learning have focused on removing inductive biases from training pipelines. However, inductive biases can be useful in settings when limited data are available or provide additional insight into the underlying data distribution. We present spatial prior attention (SPAN), a framework that takes advantage of consistent spatial and semantic structure in unlabeled image datasets to guide Vision Transformer attention. SPAN operates by regularizing attention masks from separate transformer heads to follow various priors over semantic regions. These priors can be derived from data statistics or a single labeled sample provided by a domain expert. We study SPAN through several detailed real-world scenarios, including medical image analysis and visual quality assurance. We find that the resulting attention masks are more interpretable than those derived from domain-agnostic pretraining. SPAN produces a 58.7 mAP improvement for lung and heart segmentation. We also find that our method yields a 2.2 mAUC improvement compared to domain-agnostic pretraining when transferring the pretrained model to a downstream chest disease classification task. Lastly, we show that SPAN pretraining leads to higher downstream classification performance in low-data regimes compared to domain-agnostic pretraining.