论文标题
ptseformer:渐进的时间空间增强型变压器朝着视频对象检测
PTSEFormer: Progressive Temporal-Spatial Enhanced TransFormer Towards Video Object Detection
论文作者
论文摘要
近年来,人们见证了应用上下文框架以提高对象检测作为视频对象检测的性能的趋势。现有方法通常在一次冲程中汇总特征以增强功能。但是,这些方法通常缺乏相邻帧的空间信息,并且特征聚集不足。为了解决这些问题,我们采取了一种渐进的方式来引入时间信息和空间信息以进行集成增强。时间信息由时间特征聚合模型(TFAM)引入,通过在上下文框架和目标框架之间进行注意机制(即要检测到的框架)。同时,我们采用空间过渡意识模型(StAM)来传达每个上下文框架和目标框架之间的位置过渡信息。我们的PTSeformer建立在基于变压器的检测器DETR上,还遵循端到端的方式,以避免重大的后处理过程,同时在Imagenet VID数据集上获得88.1%的地图。代码可在https://github.com/hon-wong/ptseformer上找到。
Recent years have witnessed a trend of applying context frames to boost the performance of object detection as video object detection. Existing methods usually aggregate features at one stroke to enhance the feature. These methods, however, usually lack spatial information from neighboring frames and suffer from insufficient feature aggregation. To address the issues, we perform a progressive way to introduce both temporal information and spatial information for an integrated enhancement. The temporal information is introduced by the temporal feature aggregation model (TFAM), by conducting an attention mechanism between the context frames and the target frame (i.e., the frame to be detected). Meanwhile, we employ a Spatial Transition Awareness Model (STAM) to convey the location transition information between each context frame and target frame. Built upon a transformer-based detector DETR, our PTSEFormer also follows an end-to-end fashion to avoid heavy post-processing procedures while achieving 88.1% mAP on the ImageNet VID dataset. Codes are available at https://github.com/Hon-Wong/PTSEFormer.