对密集预测任务的视觉变压器的全面研究

论文标题

对密集预测任务的视觉变压器的全面研究

A Comprehensive Study of Vision Transformers on Dense Prediction Tasks

论文作者

Jeeveswaran, Kishaan, Kathiresan, Senthilkumar, Varma, Arnav, Magdy, Omar, Zonooz, Bahram, Arani, Elahe

论文摘要

卷积神经网络（CNN），包括卷积层组成的架构，一直是视觉任务的标准选择。最近的研究表明，视觉变压器（VTS），基于自我发场模块的体系结构，在挑战性任务（例如对象检测和语义分割）中实现可比的性能。但是，VT的图像处理机制与常规CNN不同。这提出了几个问题，即当用于提取复杂任务的功能时，它们的普遍性，鲁棒性，可靠性和纹理偏差。为了解决这些问题，我们研究和比较VT和CNN体系结构作为对象检测和语义分割的特征提取器。我们广泛的经验结果表明，VT产生的功能对两项任务中的分配变化，自然腐败和对抗性攻击更为强大，而在对象检测中，CNN在更高的图像分辨率下表现更好。此外，我们的结果表明，密集的预测任务中的VT会产生更可靠，质地偏低的预测。

Convolutional Neural Networks (CNNs), architectures consisting of convolutional layers, have been the standard choice in vision tasks. Recent studies have shown that Vision Transformers (VTs), architectures based on self-attention modules, achieve comparable performance in challenging tasks such as object detection and semantic segmentation. However, the image processing mechanism of VTs is different from that of conventional CNNs. This poses several questions about their generalizability, robustness, reliability, and texture bias when used to extract features for complex tasks. To address these questions, we study and compare VT and CNN architectures as feature extractors in object detection and semantic segmentation. Our extensive empirical results show that the features generated by VTs are more robust to distribution shifts, natural corruptions, and adversarial attacks in both tasks, whereas CNNs perform better at higher image resolutions in object detection. Furthermore, our results demonstrate that VTs in dense prediction tasks produce more reliable and less texture-biased predictions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题