论文标题
在未修剪的视频中探索关系以进行自我监督学习
Exploring Relations in Untrimmed Videos for Self-Supervised Learning
论文作者
论文摘要
现有的视频自我监督学习方法主要依赖于修剪的视频进行模型培训。但是,修剪的数据集是从未修剪的视频中手动注释的。从这个意义上讲,这些方法并不是真正的自我监督。在本文中,我们提出了一种新颖的自我监督方法,称为探索未修剪视频(ERUV)中的关系,可以将其直接应用于未修剪的视频(真正的未标记)以学习时空特征。 Eruv首先通过射击更改检测生成单次视频。然后,使用设计的采样策略来模拟视频剪辑的关系。该策略被保存为我们的自我判断信号。最后,网络通过预测视频剪辑之间的关系类别来学习表示形式。 Eruv能够比较视频的差异和相似性,这也是动作和视频相关任务的重要过程。我们使用动作识别和视频检索任务来验证我们的模型,并使用三种3D CNN验证模型。实验结果表明,Eruv能够学习更丰富的表示形式,并且它的表现优于最先进的自我监督方法,并具有明显的边距。
Existing video self-supervised learning methods mainly rely on trimmed videos for model training. However, trimmed datasets are manually annotated from untrimmed videos. In this sense, these methods are not really self-supervised. In this paper, we propose a novel self-supervised method, referred to as Exploring Relations in Untrimmed Videos (ERUV), which can be straightforwardly applied to untrimmed videos (real unlabeled) to learn spatio-temporal features. ERUV first generates single-shot videos by shot change detection. Then a designed sampling strategy is used to model relations for video clips. The strategy is saved as our self-supervision signals. Finally, the network learns representations by predicting the category of relations between the video clips. ERUV is able to compare the differences and similarities of videos, which is also an essential procedure for action and video related tasks. We validate our learned models with action recognition and video retrieval tasks with three kinds of 3D CNNs. Experimental results show that ERUV is able to learn richer representations and it outperforms state-of-the-art self-supervised methods with significant margins.