论文标题
在语义细分中的多分辨率变压器的全面关注
Full Contextual Attention for Multi-resolution Transformers in Semantic Segmentation
论文作者
论文摘要
事实证明,变压器对于视觉识别任务非常有效。尤其是,视觉变形金刚通过自我注意和可学习的阶级令牌构建压缩的全球表示。多分辨率变形金刚在语义分割中显示了最新的成功,但只能捕获高分辨率特征图中的局部相互作用。本文扩展了全球令牌的概念,以建立全球注意力多分辨率(GLAM)变压器。 Glam是一个通用模块,可以集成到大多数现有的变压器骨架中。 Glam包括可学习的全球令牌,与以前的方法不同,这些令牌可以模拟所有图像区域之间的交互作用,并在训练过程中提取强大的表示形式。广泛的实验表明,华丽或华丽 - 武士的表现要比其在ADE20K和CityScapes上的香草对应物更好。此外,Glam可用于细分大型3D医学图像,而Glam-Nnformer在BCV数据集中实现了新的最先进的性能。
Transformers have proved to be very effective for visual recognition tasks. In particular, vision transformers construct compressed global representations through self-attention and learnable class tokens. Multi-resolution transformers have shown recent successes in semantic segmentation but can only capture local interactions in high-resolution feature maps. This paper extends the notion of global tokens to build GLobal Attention Multi-resolution (GLAM) transformers. GLAM is a generic module that can be integrated into most existing transformer backbones. GLAM includes learnable global tokens, which unlike previous methods can model interactions between all image regions, and extracts powerful representations during training. Extensive experiments show that GLAM-Swin or GLAM-Swin-UNet exhibit substantially better performances than their vanilla counterparts on ADE20K and Cityscapes. Moreover, GLAM can be used to segment large 3D medical images, and GLAM-nnFormer achieves new state-of-the-art performance on the BCV dataset.