论文标题
通过令牌融合改进图像分类
Improved Image Classification with Token Fusion
论文作者
论文摘要
在本文中,我们提出了一种使用CNN和变压器结构融合以提高图像分类性能的方法。对于CNN,可以很好地提取有关图像上局部区域的信息,但是限制了全局信息的提取。另一方面,变压器在相对全球的提取方面具有优势,但缺点是因为它需要大量的内存来进行本地特征值提取。在图像的情况下,它通过CNN转换为特征映射,每个特征映射的像素都被认为是令牌。同时,将图像分为补丁区域,然后与将其视为令牌视为标记的变压器方法融合。对于令牌与两个不同特征的融合,我们提出了三种方法:(1)与平行结构的晚期融合,(2)早期令牌融合,(3)逐层中的令牌融合。在使用Imagenet 1K的实验中,提出的方法显示了最佳的分类性能。
In this paper, we propose a method using the fusion of CNN and transformer structure to improve image classification performance. In the case of CNN, information about a local area on an image can be extracted well, but there is a limit to the extraction of global information. On the other hand, the transformer has an advantage in relatively global extraction, but has a disadvantage in that it requires a lot of memory for local feature value extraction. In the case of an image, it is converted into a feature map through CNN, and each feature map's pixel is considered a token. At the same time, the image is divided into patch areas and then fused with the transformer method that views them as tokens. For the fusion of tokens with two different characteristics, we propose three methods: (1) late token fusion with parallel structure, (2) early token fusion, (3) token fusion in a layer by layer. In an experiment using ImageNet 1k, the proposed method shows the best classification performance.