重新思考概括性分类

论文标题

重新思考概括性分类

Rethinking Generalization in Few-Shot Classification

论文作者

Hiller, Markus, Ma, Rongkai, Harandi, Mehrtash, Drummond, Tom

论文摘要

单图级注释仅正确地描述了图像内容的通常很小的子集，尤其是在描绘了复杂的现实世界场景时。尽管这在许多分类方案中可能是可以接受的，但对于培训时间和测试时间之间有很大差异的应用程序，它构成了重大挑战。在本文中，我们仔细研究了$ \ textit {少数图} $的含义。将输入样品分解为贴片并通过视觉变压器的帮助来编码这些样本，使我们能够在图像跨图像和独立于其各自类别的局部区域之间建立语义对应关系。然后，在推理时间在线优化设置的支持的函数，确定了手头任务的最有用的补丁嵌入，并在图像中提供`$ \ textit {最重要的} $'的视觉解释性。我们基于通过掩盖图像建模对网络进行无监督培训的最新进展，以克服缺乏细粒度的标签，并了解数据的更一般统计结构，同时避免了负面的图像级注释影响，$ \ textit {aka} $监督崩溃。实验结果表明，我们的方法的竞争力，在四个流行的少数几个分类基准测试基准中获得了新的最先进的结果，价格为$ 5 $ - 售价和$ 1 $ $ - 拍摄的情况。

Single image-level annotations only correctly describe an often small subset of an image's content, particularly when complex real-world scenes are depicted. While this might be acceptable in many classification scenarios, it poses a significant challenge for applications where the set of classes differs significantly between training and test time. In this paper, we take a closer look at the implications in the context of $\textit{few-shot learning}$. Splitting the input samples into patches and encoding these via the help of Vision Transformers allows us to establish semantic correspondences between local regions across images and independent of their respective class. The most informative patch embeddings for the task at hand are then determined as a function of the support set via online optimization at inference time, additionally providing visual interpretability of `$\textit{what matters most}$' in the image. We build on recent advances in unsupervised training of networks via masked image modelling to overcome the lack of fine-grained labels and learn the more general statistical structure of the data while avoiding negative image-level annotation influence, $\textit{aka}$ supervision collapse. Experimental results show the competitiveness of our approach, achieving new state-of-the-art results on four popular few-shot classification benchmarks for $5$-shot and $1$-shot scenarios.

下载PDF全文

下载文献需遵守相关版权规定

论文标题