论文标题
RN-VID:视频对象检测的功能融合体系结构
RN-VID: A Feature Fusion Architecture for Video Object Detection
论文作者
论文摘要
视频中的连续帧高度多余。因此,要执行视频对象检测的任务,请在每个框架上执行单帧检测器而不重复任何信息是非常浪费的。正是考虑到这个想法,我们提出了RN-VID(代表Retinanet-Video),这是一种新颖的视频对象检测方法。我们的贡献是双重的。首先,我们提出了一种新的体系结构,该架构允许附近框架的信息使用以增强特征地图。其次,我们提出了一个新型模块,以使用通道的重新排序和1 x 1的卷积来合并相同维度的特征图。然后,我们证明,与相应的单帧检测器相比,RN-VID获得的平均平均精度(MAP)更好,而推断期间几乎没有额外的成本。
Consecutive frames in a video are highly redundant. Therefore, to perform the task of video object detection, executing single frame detectors on every frame without reusing any information is quite wasteful. It is with this idea in mind that we propose RN-VID (standing for RetinaNet-VIDeo), a novel approach to video object detection. Our contributions are twofold. First, we propose a new architecture that allows the usage of information from nearby frames to enhance feature maps. Second, we propose a novel module to merge feature maps of same dimensions using re-ordering of channels and 1 x 1 convolutions. We then demonstrate that RN-VID achieves better mean average precision (mAP) than corresponding single frame detectors with little additional cost during inference.