时间图像序列和视频对象检测的对象运动的联合表示

论文标题

时间图像序列和视频对象检测的对象运动的联合表示

Joint Representation of Temporal Image Sequences and Object Motion for Video Object Detection

论文作者

Koh, Junho, Kim, Jaekyum, Shin, Younji, Lee, Byeongwon, Yang, Seungji, Choi, Jun Won

论文摘要

在本文中，我们提出了一种新的视频对象检测器（VOD）方法，称为时间特征聚合和运动感知VOD（TM-VOD），该方法产生了时间图像序列和对象运动的联合表示。提出的TM-VOD聚集了视觉特征图，该特征图被卷积神经网络提取，这些卷积神经网络应用了时间注意门控和空间特征比对。此时间特征聚合以分层的方式在两个阶段进行。在第一阶段，视觉特征图通过门控注意模型在像素级别融合。在第二阶段，所提出的方法在使用时间盒偏移校准对象特征后汇总了特征，并根据余弦相似性度量对其进行加权。提出的TM-VOD还以两个连续的步骤找到了对象运动的表示。像素级运动特征首先是根据相邻的视觉特征图之间的增量变化计算得出的。然后，从感兴趣的区域（ROI）一致的像素级运动特征和盒坐标的顺序变化中获得盒子级运动特征。最后，将所有这些特征连接在一起以产生VOD对象的联合表示。在Imagenet VID数据集上进行的实验表明，所提出的方法优于现有的VOD方法，并实现了与最新VOD相当的性能。

In this paper, we propose a new video object detector (VoD) method referred to as temporal feature aggregation and motion-aware VoD (TM-VoD), which produces a joint representation of temporal image sequences and object motion. The proposed TM-VoD aggregates visual feature maps extracted by convolutional neural networks applying the temporal attention gating and spatial feature alignment. This temporal feature aggregation is performed in two stages in a hierarchical fashion. In the first stage, the visual feature maps are fused at a pixel level via gated attention model. In the second stage, the proposed method aggregates the features after aligning the object features using temporal box offset calibration and weights them according to the cosine similarity measure. The proposed TM-VoD also finds the representation of the motion of objects in two successive steps. The pixel-level motion features are first computed based on the incremental changes between the adjacent visual feature maps. Then, box-level motion features are obtained from both the region of interest (RoI)-aligned pixel-level motion features and the sequential changes of the box coordinates. Finally, all these features are concatenated to produce a joint representation of the objects for VoD. The experiments conducted on the ImageNet VID dataset demonstrate that the proposed method outperforms existing VoD methods and achieves a performance comparable to that of state-of-the-art VoDs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题