通过信息丰富的互动学习目标跟踪的目标感知表示

论文标题

通过信息丰富的互动学习目标跟踪的目标感知表示

Learning Target-aware Representation for Visual Tracking via Informative Interactions

论文作者

Guo, Mingzhe, Zhang, Zhipeng, Fan, Heng, Jing, Liping, Lyu, Yilin, Li, Bing, Hu, Weiming

论文摘要

我们引入了一种新型的骨干结构，以提高特征表示跟踪的目标感知能力。具体而言，在观察到事实上的框架执行功能匹配时，只需使用主链的输出进行目标定位，因此没有直接的反馈从匹配模块到骨干网络，尤其是浅层层。更具体地说，只有匹配模块才能直接访问目标信息（在参考框架中），而候选框架的表示为参考目标是视而不见的。结果，目标 - 近二元干扰在浅阶段的积累效应可能会降低更深层的特征质量。在本文中，我们通过在类似暹罗的骨干网络（INBN）内进行多种分支相互作用来解决问题。 INBN的核心是一种通用的相互作用建模器（GIM），它将参考图像的先验知识注射到主干网络的不同阶段，从而使候选特征表示的更好的目标感知和强大的分散分心特征表述的抗性，并具有可忽略的计算成本。所提出的GIM模块和INBN机制是一般的，适用于包括CNN和Transferer在内的不同骨干类型的改进，这是我们对多个基准测试的广泛实验所证明的。特别是，CNN版本（基于SiamCar）分别在Lasot/TNL2K上分别具有3.2/6.9 SUC的绝对增益。 Transformer版本在Lasot/TNL2K上获得65.7/52.0的SUC分数，与最近的艺术状态相当。代码和模型将发布。

We introduce a novel backbone architecture to improve target-perception ability of feature representation for tracking. Specifically, having observed that de facto frameworks perform feature matching simply using the outputs from backbone for target localization, there is no direct feedback from the matching module to the backbone network, especially the shallow layers. More concretely, only the matching module can directly access the target information (in the reference frame), while the representation learning of candidate frame is blind to the reference target. As a consequence, the accumulation effect of target-irrelevant interference in the shallow stages may degrade the feature quality of deeper layers. In this paper, we approach the problem from a different angle by conducting multiple branch-wise interactions inside the Siamese-like backbone networks (InBN). At the core of InBN is a general interaction modeler (GIM) that injects the prior knowledge of reference image to different stages of the backbone network, leading to better target-perception and robust distractor-resistance of candidate feature representation with negligible computation cost. The proposed GIM module and InBN mechanism are general and applicable to different backbone types including CNN and Transformer for improvements, as evidenced by our extensive experiments on multiple benchmarks. In particular, the CNN version (based on SiamCAR) improves the baseline with 3.2/6.9 absolute gains of SUC on LaSOT/TNL2K, respectively. The Transformer version obtains SUC scores of 65.7/52.0 on LaSOT/TNL2K, which are on par with recent state of the arts. Code and models will be released.

下载PDF全文

下载文献需遵守相关版权规定

论文标题