视觉速度一致性的视频表示学习

论文标题

视觉速度一致性的视频表示学习

Video Representation Learning with Visual Tempo Consistency

论文作者

Yang, Ceyuan, Xu, Yinghao, Dai, Bo, Zhou, Bolei

论文摘要

描述动作的速度的Visual Tempo在监督行动识别中表现出了潜力。在这项工作中，我们证明了视觉节奏也可以作为视频表示学习的自我判断信号。我们建议通过分层对比学习（VTHCL）在慢速和快速视频的表示之间最大化相互信息。具体而言，通过分别以缓慢和快速的帧速率对同一实例进行采样，我们可以获得共享相同语义但包含不同视觉节奏的慢速和快速视频帧。从VTHCL中学到的视频表示在UCF-101（82.1 \％）和HMDB-51（49.2 \％）上的动作识别方面实现了竞争性能。此外，全面的实验表明，学到的表示形式很好地将其推广到其他下游任务，包括对AVA的行动检测以及对Epic-Kitchen的行动预期。最后，我们提出了实例对应图（ICM），以可视化通过对比度学习捕获的共享语义。

Visual tempo, which describes how fast an action goes, has shown its potential in supervised action recognition. In this work, we demonstrate that visual tempo can also serve as a self-supervision signal for video representation learning. We propose to maximize the mutual information between representations of slow and fast videos via hierarchical contrastive learning (VTHCL). Specifically, by sampling the same instance at slow and fast frame rates respectively, we can obtain slow and fast video frames which share the same semantics but contain different visual tempos. Video representations learned from VTHCL achieve the competitive performances under the self-supervision evaluation protocol for action recognition on UCF-101 (82.1\%) and HMDB-51 (49.2\%). Moreover, comprehensive experiments suggest that the learned representations are generalized well to other downstream tasks including action detection on AVA and action anticipation on Epic-Kitchen. Finally, we propose Instance Correspondence Map (ICM) to visualize the shared semantics captured by contrastive learning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题