论文标题
基于视频的人类对象的相互作用检测
Video-based Human-Object Interaction Detection from Tubelet Tokens
论文作者
论文摘要
我们提出了一种新型视觉变压器,名为Tutor,能够学习小管骨牌令牌,用于高度吸收的时空表示,用于基于视频的人类对象相互作用(V-HOI)检测。小管图形通过沿空间和时间域沿着语义相关的贴片令牌进行缔合和连接沿空间和时间领域的贴片,从而构成视频,这具有两个好处:1)紧凑度:每个小管骨骼令牌都是通过选择性注意力来学到的,以减少其他人的冗余空间依赖性; 2)表现力:每个小管骨架都可以与框架和链接的语义实例(即对象或人类)保持一致。通过广泛的实验来验证导师的有效性和效率。结果表明,我们的方法的表现优于现有的作品,相对地图的相对地图增益为$ 16.14 \%$ $,而CAD-1220和$ 4 \ times $ speedup的相对地图增益为2分。
We present a novel vision Transformer, named TUTOR, which is able to learn tubelet tokens, served as highly-abstracted spatiotemporal representations, for video-based human-object interaction (V-HOI) detection. The tubelet tokens structurize videos by agglomerating and linking semantically-related patch tokens along spatial and temporal domains, which enjoy two benefits: 1) Compactness: each tubelet token is learned by a selective attention mechanism to reduce redundant spatial dependencies from others; 2) Expressiveness: each tubelet token is enabled to align with a semantic instance, i.e., an object or a human, across frames, thanks to agglomeration and linking. The effectiveness and efficiency of TUTOR are verified by extensive experiments. Results shows our method outperforms existing works by large margins, with a relative mAP gain of $16.14\%$ on VidHOI and a 2 points gain on CAD-120 as well as a $4 \times$ speedup.