论文标题
时空小管特征聚合和视频中的对象链接
Spatio-temporal Tubelet Feature Aggregation and Object Linking in Videos
论文作者
论文摘要
本文解决了如何利用视频中可用的时空信息以提高对象检测精度的问题。我们基于短期时空特征聚合提出了一个称为Fanet的两个阶段对象检测器,以给出第一个检测集,并将其长期对象链接到完善这些检测。首先,我们生成了一组短管提案,其中包含$ n $的连续帧中的对象。然后,我们使用颞池池操作员将ROI通过小管汇总了深层特征,该操作员汇总了具有固定尺寸输出的信息,独立于输入框架的数量。最重要的是,我们定义了一个双头实现,我们使用时空汇总的信息进行时为时空对象分类,以及从当前框架中提取的空间信息以进行对象定位和空间分类。此外,我们还专注于每个Head分支体系结构,以便在每个任务中更好地执行输入数据。最后,一种长期的链接方法使用先前计算的短管来构建长管以克服检测误差。我们已经在广泛使用的Imagenet VID数据集中评估了我们的模型,该数据集获得了80.9%的地图,这是单个模型的新最新结果。同样,在具有挑战性的小对象检测数据集USC-grad-STDDB中,我们的建议比单帧基线的地图优于5.4%的地图。
This paper addresses the problem of how to exploit spatio-temporal information available in videos to improve the object detection precision. We propose a two stage object detector called FANet based on short-term spatio-temporal feature aggregation to give a first detection set, and long-term object linking to refine these detections. Firstly, we generate a set of short tubelet proposals containing the object in $N$ consecutive frames. Then, we aggregate RoI pooled deep features through the tubelet using a temporal pooling operator that summarizes the information with a fixed size output independent of the number of input frames. On top of that, we define a double head implementation that we feed with spatio-temporal aggregated information for spatio-temporal object classification, and with spatial information extracted from the current frame for object localization and spatial classification. Furthermore, we also specialize each head branch architecture to better perform in each task taking into account the input data. Finally, a long-term linking method builds long tubes using the previously calculated short tubelets to overcome detection errors. We have evaluated our model in the widely used ImageNet VID dataset achieving a 80.9% mAP, which is the new state-of-the-art result for single models. Also, in the challenging small object detection dataset USC-GRAD-STDdb, our proposal outperforms the single frame baseline by 5.4% mAP.