桥接多模式融合3D对象检测的雷达和相机功能之间的视图差异

论文标题

桥接多模式融合3D对象检测的雷达和相机功能之间的视图差异

Bridging the View Disparity Between Radar and Camera Features for Multi-modal Fusion 3D Object Detection

论文作者

Zhou, Taohua, Shi, Yining, Chen, Junjie, Jiang, Kun, Yang, Mengmeng, Yang, Diange

论文摘要

雷达和摄像机的多模式融合的环境感知对于提高准确性，完整性和鲁棒性至关重要。本文着重于利用毫米波（MMW）雷达和相机传感器融合进行3D对象检测。提出了一种新的方法，该方法在提出了更好的特征表示形式下意识到在鸟眼视图（BEV）下的特征级融合。首先，雷达点通过时间积累进行增强，并发送到时空编码器以进行雷达特征提取。同时，通过图像骨干和颈部模型获得了适应各种空间尺度的多尺度图像2D特征。然后，使用设计的视图变压器将图像功能转换为BEV。此外，这项工作分别将多模式特征与称为点融合和ROI融合的两个阶段融合模型融合在一起。最后，检测头会回归对象类别和3D位置。实验结果表明，所提出的方法在最关键的检测指标平均精度（MAP）和NUSCENES检测得分（NDS）上实现了最先进的（SOTA）性能（SOTA）。

Environmental perception with the multi-modal fusion of radar and camera is crucial in autonomous driving to increase accuracy, completeness, and robustness. This paper focuses on utilizing millimeter-wave (MMW) radar and camera sensor fusion for 3D object detection. A novel method that realizes the feature-level fusion under the bird's-eye view (BEV) for a better feature representation is proposed. Firstly, radar points are augmented with temporal accumulation and sent to a spatial-temporal encoder for radar feature extraction. Meanwhile, multi-scale image 2D features which adapt to various spatial scales are obtained by image backbone and neck model. Then, image features are transformed to BEV with the designed view transformer. In addition, this work fuses the multi-modal features with a two-stage fusion model called point-fusion and ROI-fusion, respectively. Finally, a detection head regresses objects category and 3D locations. Experimental results demonstrate that the proposed method realizes the state-of-the-art (SOTA) performance under the most crucial detection metrics-mean average precision (mAP) and nuScenes detection score (NDS) on the challenging nuScenes dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题