用3D对象检测的变压器统一基于体素的表示

论文标题

用3D对象检测的变压器统一基于体素的表示

Unifying Voxel-based Representation with Transformer for 3D Object Detection

论文作者

Li, Yanwei, Chen, Yilun, Qi, Xiaojuan, Li, Zeming, Sun, Jian, Jia, Jiaya

论文摘要

在这项工作中，我们提出了一个多模式3D对象检测的统一框架，名为UVTR。所提出的方法旨在统一体素空间中的多模式表示形式，以进行准确且稳健的单个或跨模式3D检测。为此，特定于模态的空间首先设计为代表体素特征空间中的不同输入。与以前的工作不同，我们的方法保留了无高度压缩的体素空间，以减轻语义歧义并实现空间连接。为了充分利用来自不同传感器的输入，然后提出了交叉模式相互作用，包括知识传递和模态融合。这样，可以很好地利用了点云中的几何学表达和图像中上下文丰富的特征，以提高性能和鲁棒性。将变压器解码器应用于具有可学习位置的统一空间的有效示例特征，从而有助于对象级相互作用。通常，UVTR提出了在统一框架中代表不同方式的早期尝试。它超过了单个或多模式条目的先前工作。所提出的方法在对象检测和以下对象跟踪任务的Nuscenes测试集中实现了领先性能。代码可在https://github.com/dvlab-research/uvtr上公开提供。

In this work, we present a unified framework for multi-modality 3D object detection, named UVTR. The proposed method aims to unify multi-modality representations in the voxel space for accurate and robust single- or cross-modality 3D detection. To this end, the modality-specific space is first designed to represent different inputs in the voxel feature space. Different from previous work, our approach preserves the voxel space without height compression to alleviate semantic ambiguity and enable spatial connections. To make full use of the inputs from different sensors, the cross-modality interaction is then proposed, including knowledge transfer and modality fusion. In this way, geometry-aware expressions in point clouds and context-rich features in images are well utilized for better performance and robustness. The transformer decoder is applied to efficiently sample features from the unified space with learnable positions, which facilitates object-level interactions. In general, UVTR presents an early attempt to represent different modalities in a unified framework. It surpasses previous work in single- or multi-modality entries. The proposed method achieves leading performance in the nuScenes test set for both object detection and the following object tracking task. Code is made publicly available at https://github.com/dvlab-research/UVTR.

下载PDF全文

下载文献需遵守相关版权规定

论文标题