论文标题
通过添加背景来删除背景:朝着背景强大的自我监督视频表示学习学习
Removing the Background by Adding the Background: Towards Background Robust Self-supervised Video Representation Learning
论文作者
论文摘要
自我监督的学习表现出了通过从数据本身获得监督来提高深度神经网络的视频表示能力的巨大潜力。但是,某些当前的方法倾向于从背景中作弊,即,预测高度取决于视频背景而不是运动,从而使模型容易受到背景更改的影响。为了减轻模型对背景的依赖,我们建议通过添加背景来消除背景影响。也就是说,给定视频,我们随机选择一个静态框架并将其添加到其他所有帧中以构建分散注意力的视频样本。然后,我们强迫模型来拉开注意力视频的功能和原始视频的特征,以便明确限制模型以抵制背景影响,从而更多地关注运动变化。我们将我们的方法称为\ emph {background擦除}(be)。值得注意的是,我们方法的实施是如此简单且整洁,可以在大多数SOTA方法中添加,而无需付出很多努力。具体而言,在严重偏见的数据集UCF101和HMDB51上,MOCO的提高16.4%和19.1%,在较小的偏见数据集潜水48上提高了14.5%。
Self-supervised learning has shown great potentials in improving the video representation ability of deep neural networks by getting supervision from the data itself. However, some of the current methods tend to cheat from the background, i.e., the prediction is highly dependent on the video background instead of the motion, making the model vulnerable to background changes. To mitigate the model reliance towards the background, we propose to remove the background impact by adding the background. That is, given a video, we randomly select a static frame and add it to every other frames to construct a distracting video sample. Then we force the model to pull the feature of the distracting video and the feature of the original video closer, so that the model is explicitly restricted to resist the background influence, focusing more on the motion changes. We term our method as \emph{Background Erasing} (BE). It is worth noting that the implementation of our method is so simple and neat and can be added to most of the SOTA methods without much efforts. Specifically, BE brings 16.4% and 19.1% improvements with MoCo on the severely biased datasets UCF101 and HMDB51, and 14.5% improvement on the less biased dataset Diving48.