论文标题
补丁是您需要的吗?
Patches Are All You Need?
论文作者
论文摘要
尽管卷积网络多年来一直是视觉任务的主要体系结构,但最近的实验表明,基于变压器的模型,最著名的是视觉变压器(VIT),在某些情况下可能会超过其性能。但是,由于变压器中的自发层的二次运行时,VIT需要使用贴片嵌入,将图像的小区域组合在一起单个输入特征,以便将其应用于较大的图像尺寸。这就提出了一个问题:VIT的性能是由于固有的功能变压器体系结构,还是至少部分是由于使用补丁作为输入表示形式?在本文中,我们为后者提供了一些证据:具体来说,我们提出了Convmixer,这是一种非常简单的模型,其精神与VIT和偶数基本基础的MLP-Mixer相似,因为它直接在输入的贴片上运行,分开空间和通道维度的混合,并在整个网络中保持相等的大小和分辨率。但是,相反,Convmixer仅使用标准卷积来实现混合步骤。尽管它很简单,但我们表明,对于相似的参数计数和数据集大小,Convmixer的表现优于VIT,MLP-MIXER及其某些变体,除了胜过诸如Resnet之类的经典视觉模型外。我们的代码可在https://github.com/locuslab/convmixer上找到。
Although convolutional networks have been the dominant architecture for vision tasks for many years, recent experiments have shown that Transformer-based models, most notably the Vision Transformer (ViT), may exceed their performance in some settings. However, due to the quadratic runtime of the self-attention layers in Transformers, ViTs require the use of patch embeddings, which group together small regions of the image into single input features, in order to be applied to larger image sizes. This raises a question: Is the performance of ViTs due to the inherently-more-powerful Transformer architecture, or is it at least partly due to using patches as the input representation? In this paper, we present some evidence for the latter: specifically, we propose the ConvMixer, an extremely simple model that is similar in spirit to the ViT and the even-more-basic MLP-Mixer in that it operates directly on patches as input, separates the mixing of spatial and channel dimensions, and maintains equal size and resolution throughout the network. In contrast, however, the ConvMixer uses only standard convolutions to achieve the mixing steps. Despite its simplicity, we show that the ConvMixer outperforms the ViT, MLP-Mixer, and some of their variants for similar parameter counts and data set sizes, in addition to outperforming classical vision models such as the ResNet. Our code is available at https://github.com/locuslab/convmixer.