论文标题
视频实例细分的时间有效的视觉变压器
Temporally Efficient Vision Transformer for Video Instance Segmentation
论文作者
论文摘要
最近,Vision Transformer在图像级的视觉识别任务上取得了巨大的成功。为了有效,有效地对视频剪辑中的关键时间信息进行建模,我们为视频实例分割(VIS)提出了一个时间高效的视觉变压器(TEVIT)。与以前的基于变压器的VIS方法不同,TEVIT几乎不含卷积,其中包含变压器主链和基于查询的视频实例分割头。在骨干阶段,我们为早期时间上下文融合提出了一种几乎无参数的信使移动机制。在头部阶段,我们提出了一个参数共享的时空查询交互机制,以在视频实例和查询之间建立一对一的对应关系。因此,TEVIT完全利用了FraMelevel和实例级的时间上下文信息,并获得了可忽略不计的额外计算成本的强大时间建模能力。在三个广泛采用的VIS基准测试中,即YouTube-VIS-2019,YouTube-VIS-2021和OVIS,Tevit获得了最先进的结果并保持高推理速度,例如46.6 AP,在YouTube-Vis-2019上具有68.9 fps的46.6 AP。代码可在https://github.com/hustvl/tevit上找到。
Recently vision transformer has achieved tremendous success on image-level visual recognition tasks. To effectively and efficiently model the crucial temporal information within a video clip, we propose a Temporally Efficient Vision Transformer (TeViT) for video instance segmentation (VIS). Different from previous transformer-based VIS methods, TeViT is nearly convolution-free, which contains a transformer backbone and a query-based video instance segmentation head. In the backbone stage, we propose a nearly parameter-free messenger shift mechanism for early temporal context fusion. In the head stages, we propose a parameter-shared spatiotemporal query interaction mechanism to build the one-to-one correspondence between video instances and queries. Thus, TeViT fully utilizes both framelevel and instance-level temporal context information and obtains strong temporal modeling capacity with negligible extra computational cost. On three widely adopted VIS benchmarks, i.e., YouTube-VIS-2019, YouTube-VIS-2021, and OVIS, TeViT obtains state-of-the-art results and maintains high inference speed, e.g., 46.6 AP with 68.9 FPS on YouTube-VIS-2019. Code is available at https://github.com/hustvl/TeViT.