论文标题
以视频关注和时间上下文识别以自我为中心的行动
Egocentric Action Recognition by Video Attention and Temporal Context
论文作者
论文摘要
我们提出了三星AI中心剑桥的提交给CVPR2020 Epic-Kitchens Action识别挑战。在这一挑战中,将动作识别构成同时预测单个“动词”和“名词”类标签的问题,给定输入修剪的视频剪辑。也就是说,一个“动词”和``名词''一起定义了一个构图``oction''类。这项现实生活中的行动识别任务的挑战性方面包括小快速移动对象,复杂的手动相互作用和遮挡。我们提交的核心是一个最近提供的时空视频注意力模型,称为“ W3”(what-where-when')注意〜\ cite {perez2020 knewing}。我们进一步介绍了一种简单而有效的上下文学习机制,以基于“动词”和“名词”预测分数直接从长期的时间行为中对“动作”类别进行建模。我们的解决方案在不使用特定对象推理或额外的培训数据的情况下,在挑战指标上实现了强劲的绩效。特别是,我们使用多模式合奏的最佳解决方案可以实现2 $^{nd} $的“动词”最佳位置,以及3 $^{rd} $最佳的“名词”和``noun''的最佳位置和在可见厨房测试集中的“ noun”和“ action”。
We present the submission of Samsung AI Centre Cambridge to the CVPR2020 EPIC-Kitchens Action Recognition Challenge. In this challenge, action recognition is posed as the problem of simultaneously predicting a single `verb' and `noun' class label given an input trimmed video clip. That is, a `verb' and a `noun' together define a compositional `action' class. The challenging aspects of this real-life action recognition task include small fast moving objects, complex hand-object interactions, and occlusions. At the core of our submission is a recently-proposed spatial-temporal video attention model, called `W3' (`What-Where-When') attention~\cite{perez2020knowing}. We further introduce a simple yet effective contextual learning mechanism to model `action' class scores directly from long-term temporal behaviour based on the `verb' and `noun' prediction scores. Our solution achieves strong performance on the challenge metrics without using object-specific reasoning nor extra training data. In particular, our best solution with multimodal ensemble achieves the 2$^{nd}$ best position for `verb', and 3$^{rd}$ best for `noun' and `action' on the Seen Kitchens test set.