扩张的邻里注意力变压器

论文标题

扩张的邻里注意力变压器

Dilated Neighborhood Attention Transformer

论文作者

Hassani, Ali, Shi, Humphrey

论文摘要

变形金刚迅速成为跨模式，域和任务跨越最深入的深度学习体系结构之一。在视觉上，除了对普通变压器的持续努力之外，层次变压器还引起了人们的重大关注，这要归功于它们的性能并易于整合到现有框架中。这些模型通常采用局部注意机制，例如滑动窗口邻里的注意力（NA）或Swin Transformer转移的窗户自我关注。尽管有效地降低了自我注意力的二次复杂性，但局部关注却削弱了自我注意力最令人期望的两个特性：远距离相互依赖性建模和全球接受领域。在本文中，我们引入了扩张的邻里注意力（DINA），这是NA的天然，灵活和有效的扩展，可以捕获更多的全球环境，并以无需额外的成本呈指数级扩展接受场。 NA的当地注意力和Dina的稀疏全球关注相互补充，因此我们引入了扩张的邻里注意力变压器（Dinat），这是一种建立在两者基础的新等级视觉变压器。 Dinat变体比NAT，SWIN和Convnext等强大的基线享有重大改进。我们的大型模型更快，在可可对象检测中的SWIN对应物中的AP较快，可可实例分段中的1.4％掩码AP和ADE20K语义分段中的1.4％MIOU。与新框架配对，我们的大型变体是可可（58.5 PQ）和ADE20K（49.4 PQ）上的最新综合分割模型，以及CityScapes（45.1 AP）和ADE20K（45.1 AP）和ADE20K（35.4 AP）的实例分割模型（无额外的数据）。它还与ADE20K（58.1 MIOU）上的最先进的语义细分模型相匹配，并且在CityScapes（84.5 MIOU）上排名第二（没有额外的数据）。

Transformers are quickly becoming one of the most heavily applied deep learning architectures across modalities, domains, and tasks. In vision, on top of ongoing efforts into plain transformers, hierarchical transformers have also gained significant attention, thanks to their performance and easy integration into existing frameworks. These models typically employ localized attention mechanisms, such as the sliding-window Neighborhood Attention (NA) or Swin Transformer's Shifted Window Self Attention. While effective at reducing self attention's quadratic complexity, local attention weakens two of the most desirable properties of self attention: long range inter-dependency modeling, and global receptive field. In this paper, we introduce Dilated Neighborhood Attention (DiNA), a natural, flexible and efficient extension to NA that can capture more global context and expand receptive fields exponentially at no additional cost. NA's local attention and DiNA's sparse global attention complement each other, and therefore we introduce Dilated Neighborhood Attention Transformer (DiNAT), a new hierarchical vision transformer built upon both. DiNAT variants enjoy significant improvements over strong baselines such as NAT, Swin, and ConvNeXt. Our large model is faster and ahead of its Swin counterpart by 1.6% box AP in COCO object detection, 1.4% mask AP in COCO instance segmentation, and 1.4% mIoU in ADE20K semantic segmentation. Paired with new frameworks, our large variant is the new state of the art panoptic segmentation model on COCO (58.5 PQ) and ADE20K (49.4 PQ), and instance segmentation model on Cityscapes (45.1 AP) and ADE20K (35.4 AP) (no extra data). It also matches the state of the art specialized semantic segmentation models on ADE20K (58.1 mIoU), and ranks second on Cityscapes (84.5 mIoU) (no extra data).

下载PDF全文

下载文献需遵守相关版权规定

论文标题