视频表示学习的自我监督时间歧视学习

论文标题

视频表示学习的自我监督时间歧视学习

Self-supervised Temporal Discriminative Learning for Video Representation Learning

论文作者

Wang, Jinpeng, Lin, Yiqi, Ma, Andy J., Yuen, Pong C.

论文摘要

视频中的时间提示提供了重要的信息，以准确识别动作。但是，如果不使用带注释的大规模视频动作数据集进行训练，几乎无法提取时间歧义特征。本文提出了一个新颖的基于视频的时间歧义学习（VTDL）框架，以自我监督的方式进行。如果没有标记的网络预处理数据，则通过使用相同或不同的时间间隔的段为每个锚视频生成时间三重序列，以增强时间特征表示的能力。通过时间导数测量时间衍生物，时间一致的增强（TCA）旨在确保增强阳性的时间派生派（按任何顺序）不变，除了缩放常数。最后，通过最大程度地降低每个锚和增强阳性的距离来学习时间歧视特征，而每个锚点之间的距离与其增强的负面以及保存在存储库中的其他视频之间的距离最大化以丰富表示多样性。在下游动作识别任务中，所提出的方法显着胜过现有的相关工作。令人惊讶的是，当小规模的视频数据集（只有成千上万个视频）用于预训练时，提出的自我监督方法比UCF101和HMDB51上的完全监督方法更好。该代码已在https://github.com/fingerrec/self-superist-temporal-discriminative-presentation-presentation-learnning-for-video-action-acegnition上公开提供。

Temporal cues in videos provide important information for recognizing actions accurately. However, temporal-discriminative features can hardly be extracted without using an annotated large-scale video action dataset for training. This paper proposes a novel Video-based Temporal-Discriminative Learning (VTDL) framework in self-supervised manner. Without labelled data for network pretraining, temporal triplet is generated for each anchor video by using segment of the same or different time interval so as to enhance the capacity for temporal feature representation. Measuring temporal information by time derivative, Temporal Consistent Augmentation (TCA) is designed to ensure that the time derivative (in any order) of the augmented positive is invariant except for a scaling constant. Finally, temporal-discriminative features are learnt by minimizing the distance between each anchor and its augmented positive, while the distance between each anchor and its augmented negative as well as other videos saved in the memory bank is maximized to enrich the representation diversity. In the downstream action recognition task, the proposed method significantly outperforms existing related works. Surprisingly, the proposed self-supervised approach is better than fully-supervised methods on UCF101 and HMDB51 when a small-scale video dataset (with only thousands of videos) is used for pre-training. The code has been made publicly available on https://github.com/FingerRec/Self-Supervised-Temporal-Discriminative-Representation-Learning-for-Video-Action-Recognition.

下载PDF全文

下载文献需遵守相关版权规定

论文标题