在视觉变压器中重新考虑查询键的成对相互作用

论文标题

在视觉变压器中重新考虑查询键的成对相互作用

Rethinking Query-Key Pairwise Interactions in Vision Transformers

论文作者

Li, Cheng, Liu, Yangxin

论文摘要

视觉变压器在许多视觉任务中都取得了最新的性能。由于自我注意力的二次计算和记忆复杂性，最近的作品要么仅将注意力应用于低分辨率输入，要么将接受场限制在小局部区域。为了克服这些局限性，我们提出了仅关键注意的关注，该关注不包括查询键的成对相互作用，并使用计算有效的显着性门来获得注意力重量，从而在各个阶段对局部全球相互作用进行建模。仅密钥注意力具有线性计算和内存复杂性W.R.T输入大小。我们使用替代布局来杂交卷积和注意力层而不是嫁接，这是先前作品所建议的，因此所有阶段都可以从空间的注意力和卷积中受益。我们利用这些改进来开发一个新的自我发项模型家族Linglos，该家族在参数限制的ImageNet分类基准的设置上达到最新精度，并且在下游任务中大大超过了基线，例如COCO对象检测和ADE20K Sentical Semantic sempementation。

Vision Transformers have achieved state-of-the-art performance in many visual tasks. Due to the quadratic computational and memory complexities of self-attention, recent works either apply attention only to low-resolution inputs or restrict the receptive field to a small local region. To overcome these limitations, we propose key-only attention, which excludes query-key pairwise interactions and uses a compute-efficient saliency-gate to obtain attention weights, modeling local-global interactions in all stages. Key-only attention has linear computational and memory complexities w.r.t input size. We use alternate layout to hybridize convolution and attention layers instead of grafting which is suggested by previous works, so that all stages can benefit from both spatial attentions and convolutions. We leverage these improvements to develop a new self-attention model family, LinGlos, which reach state-of-the-art accuracies on the parameter-limited setting of ImageNet classification benchmark, and outperform baselines significantly in downstream tasks, e.g., COCO object detection and ADE20K semantic segmentation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题