具有特征透视转换的多视图检测

论文标题

具有特征透视转换的多视图检测

Multiview Detection with Feature Perspective Transformation

论文作者

Hou, Yunzhong, Zheng, Liang, Gould, Stephen

论文摘要

合并多个摄像头的检测可以减轻在拥挤的场景中闭塞的影响。在多视图系统中，我们需要在处理遮挡引起的歧义时回答两个重要的问题。首先，我们应该如何从多个视图中汇总提示？其次，我们应该如何汇总被闭塞污染的不可靠的2D和3D空间信息？为了解决这些问题，我们提出了一个新颖的多视图检测系统MVDET。对于多视图聚合，现有方法结合了图像平面的锚框特征，这可能会由于不准确的锚盒形状和尺寸而限制性能。相比之下，我们采用一种无锚方法来通过将特征图投影到地面平面（鸟类的眼睛视图）来汇总多视图信息。为了解决任何剩余的空间歧义，我们在地面特征图上应用大型内核卷积，并从检测峰中推断位置。我们的整个模型都是端到端学习的，在标准Wildtrack数据集上实现了88.2％的Moda，表现优于最先进的14.1％。我们还提供了新引入的合成数据集MultiviewX上的MVDET的详细分析，该数据集使我们能够控制遮挡水平。代码和Multiviewx数据集可在https://github.com/hou-yz/mvdet上找到。

Incorporating multiple camera views for detection alleviates the impact of occlusions in crowded scenes. In a multiview system, we need to answer two important questions when dealing with ambiguities that arise from occlusions. First, how should we aggregate cues from the multiple views? Second, how should we aggregate unreliable 2D and 3D spatial information that has been tainted by occlusions? To address these questions, we propose a novel multiview detection system, MVDet. For multiview aggregation, existing methods combine anchor box features from the image plane, which potentially limits performance due to inaccurate anchor box shapes and sizes. In contrast, we take an anchor-free approach to aggregate multiview information by projecting feature maps onto the ground plane (bird's eye view). To resolve any remaining spatial ambiguity, we apply large kernel convolutions on the ground plane feature map and infer locations from detection peaks. Our entire model is end-to-end learnable and achieves 88.2% MODA on the standard Wildtrack dataset, outperforming the state-of-the-art by 14.1%. We also provide detailed analysis of MVDet on a newly introduced synthetic dataset, MultiviewX, which allows us to control the level of occlusion. Code and MultiviewX dataset are available at https://github.com/hou-yz/MVDet.

下载PDF全文

下载文献需遵守相关版权规定

论文标题