论文标题
通过持续的预测学习来进行动作定位
Action Localization through Continual Predictive Learning
论文作者
论文摘要
动作识别问题涉及在视频中定位动作,无论是随着时间的时间还是空间上的图像。当前的主要方法使用监督的学习来解决此问题,并需要大量带注释的培训数据,以框架级边界框的注释的形式。在本文中,我们提出了一种基于持续学习的新方法,该方法将特征级预测用于自学。它不需要在框架级边界框方面进行任何培训注释。该方法的灵感来自视觉事件感知的认知模型,该模型提出了一种基于预测的事件理解方法。我们使用与CNN编码器以及新颖的注意机制相结合的LSTMS堆栈来对视频中的事件进行建模,并使用此模型来预测未来框架的高级特征。预测错误用于连续学习模型的参数。这个自我监督的框架并不像其他方法那样复杂,而是在学习标记和本地化的强大视觉表示方面非常有效。应当指出的是,该方法以流式传输方式输出,只需要单个通过视频,这使其适合实时处理。我们在三个数据集(UCF Sports,JHMDB和Thumos'13)上证明了这一点,并表明所提出的方法的表现优于弱监督和无监督的基准,并且与完全监督的基线相比,它具有竞争性能。最后,我们表明所提出的框架可以概括为以自我为中心的视频,并获得无监督的凝视预测结果。
The problem of action recognition involves locating the action in the video, both over time and spatially in the image. The dominant current approaches use supervised learning to solve this problem, and require large amounts of annotated training data, in the form of frame-level bounding box annotations around the region of interest. In this paper, we present a new approach based on continual learning that uses feature-level predictions for self-supervision. It does not require any training annotations in terms of frame-level bounding boxes. The approach is inspired by cognitive models of visual event perception that propose a prediction-based approach to event understanding. We use a stack of LSTMs coupled with CNN encoder, along with novel attention mechanisms, to model the events in the video and use this model to predict high-level features for the future frames. The prediction errors are used to continuously learn the parameters of the models. This self-supervised framework is not complicated as other approaches but is very effective in learning robust visual representations for both labeling and localization. It should be noted that the approach outputs in a streaming fashion, requiring only a single pass through the video, making it amenable for real-time processing. We demonstrate this on three datasets - UCF Sports, JHMDB, and THUMOS'13 and show that the proposed approach outperforms weakly-supervised and unsupervised baselines and obtains competitive performance compared to fully supervised baselines. Finally, we show that the proposed framework can generalize to egocentric videos and obtain state-of-the-art results in unsupervised gaze prediction.