论文标题
一袋图像补丁嵌入了自我监督学习的成功背后
Bag of Image Patch Embedding Behind the Success of Self-Supervised Learning
论文作者
论文摘要
自我监督的学习(SSL)最近在学习图像表示方面取得了巨大的经验进步。但是,我们对学习这种表示的原则的理解仍然有限。这项工作表明,联合安装SSL方法主要学习图像贴片的表示,这反映了它们的共发生。可以正式建立与同时建模的连接,并为当前的不变性透视图提供。我们从经验上表明,学习固定尺度贴片的表示形式,并将局部补丁表示作为图像表示的成就,甚至比基线方法相似甚至更好。我们将此过程表示为Bagssl。即使使用32x32补丁表示,BAGSSL在Imagenet上也达到了62%的Top-1线性探测精度。另一方面,使用多尺度预验证的模型,我们表明整个图像嵌入大约是局部贴片嵌入的平均值。尽管SSL表示在全球范围内相对不变,但我们表明,当我们缩小局部补丁级表示时,将保留位置。此外,我们表明斑块表示聚合可以通过很大的边缘提高各种SOTA基线方法。补丁表示非常容易理解,这项工作迈出了揭开自我监督的表示学习的一步。
Self-supervised learning (SSL) has recently achieved tremendous empirical advancements in learning image representation. However, our understanding of the principle behind learning such a representation is still limited. This work shows that joint-embedding SSL approaches primarily learn a representation of image patches, which reflects their co-occurrence. Such a connection to co-occurrence modeling can be established formally, and it supplements the prevailing invariance perspective. We empirically show that learning a representation for fixed-scale patches and aggregating local patch representations as the image representation achieves similar or even better results than the baseline methods. We denote this process as BagSSL. Even with 32x32 patch representation, BagSSL achieves 62% top-1 linear probing accuracy on ImageNet. On the other hand, with a multi-scale pretrained model, we show that the whole image embedding is approximately the average of local patch embeddings. While the SSL representation is relatively invariant at the global scale, we show that locality is preserved when we zoom into local patch-level representation. Further, we show that patch representation aggregation can improve various SOTA baseline methods by a large margin. The patch representation is considerably easier to understand, and this work makes a step to demystify self-supervised representation learning.