论文标题
3月:蒙面的自动编码器,以识别有效的动作识别
MAR: Masked Autoencoders for Efficient Action Recognition
论文作者
论文摘要
视频识别的标准方法通常在完整的输入视频上运行,这是由于视频中广泛存在的时空冗余效率而效率低下。蒙版视频建模(即视频)的最新进展表明,香草视觉变压器(VIT)仅在有限的视觉内容中补充时空上下文的能力。在此灵感的启发下,我们提出了建议的蒙版动作识别(MAR),该识别(MAR)通过丢弃一定比例的补丁并仅在视频的一部分上操作来减少冗余计算。 MAR包含以下两个必不可少的组件:单元运行掩盖和桥接分类器。具体而言,要使VIT可以轻松地感知细节以外的细节,并且会呈现单元格掩蔽,以保留视频中的时空相关性,从而确保可以在同一空间位置观察到在同一空间位置的贴片以易于重建。此外,我们注意到,尽管部分观察到的特征可以重建语义上明确的隐形贴片,但它们无法实现准确的分类。为了解决这个问题,提出了一个桥接分类器,以弥合重建的VIT编码功能与专门用于分类的功能之间的语义差距。我们提出的MAR将VIT的计算成本降低了53%,并且广泛的实验表明,MAR始终以明显的边缘优于现有的VIT模型。尤其是,我们发现,通过MAR训练的Vit-Lage胜过由标准培训方案训练的Vit-Begunge,它通过说服了Kinetics-400和某些东西的V2数据集,而VIT-LARGE的计算开销仅为Vit-Huge的14.5%。
Standard approaches for video recognition usually operate on the full input videos, which is inefficient due to the widely present spatio-temporal redundancy in videos. Recent progress in masked video modelling, i.e., VideoMAE, has shown the ability of vanilla Vision Transformers (ViT) to complement spatio-temporal contexts given only limited visual contents. Inspired by this, we propose propose Masked Action Recognition (MAR), which reduces the redundant computation by discarding a proportion of patches and operating only on a part of the videos. MAR contains the following two indispensable components: cell running masking and bridging classifier. Specifically, to enable the ViT to perceive the details beyond the visible patches easily, cell running masking is presented to preserve the spatio-temporal correlations in videos, which ensures the patches at the same spatial location can be observed in turn for easy reconstructions. Additionally, we notice that, although the partially observed features can reconstruct semantically explicit invisible patches, they fail to achieve accurate classification. To address this, a bridging classifier is proposed to bridge the semantic gap between the ViT encoded features for reconstruction and the features specialized for classification. Our proposed MAR reduces the computational cost of ViT by 53% and extensive experiments show that MAR consistently outperforms existing ViT models with a notable margin. Especially, we found a ViT-Large trained by MAR outperforms the ViT-Huge trained by a standard training scheme by convincing margins on both Kinetics-400 and Something-Something v2 datasets, while our computation overhead of ViT-Large is only 14.5% of ViT-Huge.