可扩展，高效且有效的基于变压器的对象检测器

论文标题

可扩展，高效且有效的基于变压器的对象检测器

An Extendable, Efficient and Effective Transformer-based Object Detector

论文作者

Song, Hwanjun, Sun, Deqing, Chun, Sanghyuk, Jampani, Varun, Han, Dongyoon, Heo, Byeongho, Kim, Wonjae, Yang, Ming-Hsuan

论文摘要

变压器已被广泛用于许多视力问题，尤其是用于视觉识别和检测。检测变压器是第一个用于对象检测的完全端到端学习系统，而视觉变压器是第一个用于图像分类的完全基于变压器的体系结构。在本文中，我们整合了视觉和检测变压器（VIDT）以构建有效而有效的对象检测器。 VIDT引入了重新配置的注意模块，将最近的SWIN变压器扩展为独立对象检测器，然后是计算高效的变压器解码器，该解码器利用了多规模特征和辅助技术，对于提高检测性能而没有大量增加计算负载。此外，我们将其扩展到Vidt+，以支持关节任务学习以进行对象检测和实例分段。具体而言，我们附加了有效的多尺度特征融合层，并利用了另外两个辅助训练损失，IOU ANAWARE损失和令牌标签损失。 Microsoft Coco基准数据集的广泛评估结果表明，Vidt在现有的基于完全变压器的对象检测器中获得了最佳的AP和延迟权衡，并且其扩展的VIDT+实现了53.2AP，由于其对大型模型的高可扩展性。源代码和训练有素的模型可在https://github.com/naver-ai/vidt上找到。

Transformers have been widely used in numerous vision problems especially for visual recognition and detection. Detection transformers are the first fully end-to-end learning systems for object detection, while vision transformers are the first fully transformer-based architecture for image classification. In this paper, we integrate Vision and Detection Transformers (ViDT) to construct an effective and efficient object detector. ViDT introduces a reconfigured attention module to extend the recent Swin Transformer to be a standalone object detector, followed by a computationally efficient transformer decoder that exploits multi-scale features and auxiliary techniques essential to boost the detection performance without much increase in computational load. In addition, we extend it to ViDT+ to support joint-task learning for object detection and instance segmentation. Specifically, we attach an efficient multi-scale feature fusion layer and utilize two more auxiliary training losses, IoU-aware loss and token labeling loss. Extensive evaluation results on the Microsoft COCO benchmark dataset demonstrate that ViDT obtains the best AP and latency trade-off among existing fully transformer-based object detectors, and its extended ViDT+ achieves 53.2AP owing to its high scalability for large models. The source code and trained models are available at https://github.com/naver-ai/vidt.

下载PDF全文

下载文献需遵守相关版权规定

论文标题