视频表示学习的频率选择性增强

论文标题

视频表示学习的频率选择性增强

Frequency Selective Augmentation for Video Representation Learning

论文作者

Kim, Jinhyung, Kim, Taeoh, Shim, Minho, Han, Dongyoon, Wee, Dongyoon, Kim, Junmo

论文摘要

最近的自我监督视频表示方法的学习方法是最大化同一视频中多个增强视图之间的相似性，并在很大程度上依赖于生成的观点的质量。但是，大多数现有方法都缺乏一种机制，可以防止表示视频中对静态信息的偏见学习。在本文中，我们提出了频率增强（Freqaug），这是用于视频表示学习频域中的时空数据增强方法。 Freqaug随机地从视频中删除了特定的频率组件，因此学习的表示形式从其余信息中捕获了各种下游任务的剩余信息的基本功能。具体而言，Frequag通过丢弃空间或颞频低频组件，将模型推向视频中的动态功能，而不是静态功能。为了验证所提出的方法的一般性，我们在多个自我监督的学习框架和标准增强框架上尝试了freqaug。将改进的表示形式转移到五个视频动作识别和下游任务的两个时间动作定位显示出对基准的一致改进。

Recent self-supervised video representation learning methods focus on maximizing the similarity between multiple augmented views from the same video and largely rely on the quality of generated views. However, most existing methods lack a mechanism to prevent representation learning from bias towards static information in the video. In this paper, we propose frequency augmentation (FreqAug), a spatio-temporal data augmentation method in the frequency domain for video representation learning. FreqAug stochastically removes specific frequency components from the video so that learned representation captures essential features more from the remaining information for various downstream tasks. Specifically, FreqAug pushes the model to focus more on dynamic features rather than static features in the video via dropping spatial or temporal low-frequency components. To verify the generality of the proposed method, we experiment with FreqAug on multiple self-supervised learning frameworks along with standard augmentations. Transferring the improved representation to five video action recognition and two temporal action localization downstream tasks shows consistent improvements over baselines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题