移动视觉变压器的可分离自我注意力

论文标题

移动视觉变压器的可分离自我注意力

Separable Self-attention for Mobile Vision Transformers

论文作者

Mehta, Sachin, Rastegari, Mohammad

论文摘要

移动视觉变压器（移动设备）可以在几个移动视觉任务（包括分类和检测）中实现最新的性能。尽管这些模型的参数较少，但与基于卷积神经网络的模型相比，它们的潜伏期很高。移动设备中的主要效率瓶颈是变压器中的多头自我注意力（MHA），它需要$ o（k^2）$时间复杂性相对于令牌（或补丁）$ k $的数量。此外，MHA需要昂贵的操作（例如，批量矩阵乘法）来计算自我注意力，从而影响资源约束设备的延迟。本文介绍了一种具有线性复杂性的可分离自我注意方法，即$ o（k）$。提出的方法的一个简单但有效的特征是它使用元素操作来计算自我注意力，这是对资源受限设备的理想选择。改进的模型MobileVitV2是几个移动视觉任务的最新模型，包括ImageNet对象分类和MS-Coco对象检测。 MobileVitv2凭借大约300万个参数，在ImageNet数据集上获得了75.6％的TOP-1准确性，在移动设备上运行$ 3.2 \ times $，在运行$ 3.2 \ times $时，MobileVit的表现优于1％。我们的源代码可在：\ url {https://github.com/apple/ml-cvnets}中获得

Mobile vision transformers (MobileViT) can achieve state-of-the-art performance across several mobile vision tasks, including classification and detection. Though these models have fewer parameters, they have high latency as compared to convolutional neural network-based models. The main efficiency bottleneck in MobileViT is the multi-headed self-attention (MHA) in transformers, which requires $O(k^2)$ time complexity with respect to the number of tokens (or patches) $k$. Moreover, MHA requires costly operations (e.g., batch-wise matrix multiplication) for computing self-attention, impacting latency on resource-constrained devices. This paper introduces a separable self-attention method with linear complexity, i.e. $O(k)$. A simple yet effective characteristic of the proposed method is that it uses element-wise operations for computing self-attention, making it a good choice for resource-constrained devices. The improved model, MobileViTv2, is state-of-the-art on several mobile vision tasks, including ImageNet object classification and MS-COCO object detection. With about three million parameters, MobileViTv2 achieves a top-1 accuracy of 75.6% on the ImageNet dataset, outperforming MobileViT by about 1% while running $3.2\times$ faster on a mobile device. Our source code is available at: \url{https://github.com/apple/ml-cvnets}

下载PDF全文

下载文献需遵守相关版权规定

论文标题