视频实例细分的时间有效的视觉变压器

论文标题

视频实例细分的时间有效的视觉变压器

Temporally Efficient Vision Transformer for Video Instance Segmentation

论文作者

Yang, Shusheng, Wang, Xinggang, Li, Yu, Fang, Yuxin, Fang, Jiemin, Liu, Wenyu, Zhao, Xun, Shan, Ying

论文摘要

最近，Vision Transformer在图像级的视觉识别任务上取得了巨大的成功。为了有效，有效地对视频剪辑中的关键时间信息进行建模，我们为视频实例分割（VIS）提出了一个时间高效的视觉变压器（TEVIT）。与以前的基于变压器的VIS方法不同，TEVIT几乎不含卷积，其中包含变压器主链和基于查询的视频实例分割头。在骨干阶段，我们为早期时间上下文融合提出了一种几乎无参数的信使移动机制。在头部阶段，我们提出了一个参数共享的时空查询交互机制，以在视频实例和查询之间建立一对一的对应关系。因此，TEVIT完全利用了FraMelevel和实例级的时间上下文信息，并获得了可忽略不计的额外计算成本的强大时间建模能力。在三个广泛采用的VIS基准测试中，即YouTube-VIS-2019，YouTube-VIS-2021和OVIS，Tevit获得了最先进的结果并保持高推理速度，例如46.6 AP，在YouTube-Vis-2019上具有68.9 fps的46.6 AP。代码可在https://github.com/hustvl/tevit上找到。

Recently vision transformer has achieved tremendous success on image-level visual recognition tasks. To effectively and efficiently model the crucial temporal information within a video clip, we propose a Temporally Efficient Vision Transformer (TeViT) for video instance segmentation (VIS). Different from previous transformer-based VIS methods, TeViT is nearly convolution-free, which contains a transformer backbone and a query-based video instance segmentation head. In the backbone stage, we propose a nearly parameter-free messenger shift mechanism for early temporal context fusion. In the head stages, we propose a parameter-shared spatiotemporal query interaction mechanism to build the one-to-one correspondence between video instances and queries. Thus, TeViT fully utilizes both framelevel and instance-level temporal context information and obtains strong temporal modeling capacity with negligible extra computational cost. On three widely adopted VIS benchmarks, i.e., YouTube-VIS-2019, YouTube-VIS-2021, and OVIS, TeViT obtains state-of-the-art results and maintains high inference speed, e.g., 46.6 AP with 68.9 FPS on YouTube-VIS-2019. Code is available at https://github.com/hustvl/TeViT.

下载PDF全文

下载文献需遵守相关版权规定

论文标题