我的观点是最好的观点：从以自我为中心的视频中学习的过程

论文标题

我的观点是最好的观点：从以自我为中心的视频中学习的过程

My View is the Best View: Procedure Learning from Egocentric Videos

论文作者

Bansal, Siddhant, Arora, Chetan, Jawahar, C. V.

论文摘要

过程学习涉及确定键步并确定其逻辑顺序执行任务。现有方法通常使用第三人称视频来学习该过程，使操纵对象的外观很小，并且经常被演员遮住，从而导致重大错误。相比之下，我们观察到从第一人称（Egentric）可穿戴摄像机获得的视频提供了对动作的毫无开创且清晰的视野。但是，从以eg中心视频学习的程序学习是具有挑战性的，因为（a）由于佩戴者的头部运动，相机视图发生了极端变化，并且（b）由于视频的不受约束性质而存在无关的框架。因此，当前的最新方法的假设是，该动作大约同时发生并且持续时间相同，因此不持有。取而代之的是，我们建议使用视频键位之间的时间对应关系提供的信号。为此，我们提出了一个新颖的自我监督对应和剪切（CNC）的程序学习框架。 CNC识别并利用多个视频的键位之间的时间对应关系来学习该过程。我们的实验表明，CNC的表现分别优于基准Procel和Crosstask数据集上的最先进，分别为5.2％和6.3％。此外，为了使用以自我为中心的视频进行程序学习，我们提出了Egoprocel数据集，该数据集由130名受试者捕获的62个小时的视频组成，执行16个任务。源代码和数据集可在项目页面https://sid2697.github.io/egoprocel/上找到。

Procedure learning involves identifying the key-steps and determining their logical order to perform a task. Existing approaches commonly use third-person videos for learning the procedure, making the manipulated object small in appearance and often occluded by the actor, leading to significant errors. In contrast, we observe that videos obtained from first-person (egocentric) wearable cameras provide an unobstructed and clear view of the action. However, procedure learning from egocentric videos is challenging because (a) the camera view undergoes extreme changes due to the wearer's head motion, and (b) the presence of unrelated frames due to the unconstrained nature of the videos. Due to this, current state-of-the-art methods' assumptions that the actions occur at approximately the same time and are of the same duration, do not hold. Instead, we propose to use the signal provided by the temporal correspondences between key-steps across videos. To this end, we present a novel self-supervised Correspond and Cut (CnC) framework for procedure learning. CnC identifies and utilizes the temporal correspondences between the key-steps across multiple videos to learn the procedure. Our experiments show that CnC outperforms the state-of-the-art on the benchmark ProceL and CrossTask datasets by 5.2% and 6.3%, respectively. Furthermore, for procedure learning using egocentric videos, we propose the EgoProceL dataset consisting of 62 hours of videos captured by 130 subjects performing 16 tasks. The source code and the dataset are available on the project page https://sid2697.github.io/egoprocel/.

下载PDF全文

下载文献需遵守相关版权规定

论文标题