以对象为中心的视频表示对转移有益吗？

论文标题

以对象为中心的视频表示对转移有益吗？

Is an Object-Centric Video Representation Beneficial for Transfer?

论文作者

Zhang, Chuhan, Gupta, Ankush, Zisserman, Andrew

论文摘要

这项工作的目的是学习一个以对象为中心的视频表示形式，目的是提高对新任务的可转让性，即与动作分类前训练任务不同的任务。为此，我们介绍了一个基于变压器体系结构的新的以对象为中心的视频识别模型。该模型为视频学习了一组以对象为中心的摘要向量，并使用这些向量融合了视频剪辑的视觉和时空轨迹“模态”。我们还引入了一种新型的轨迹对比损失，以进一步增强这些摘要向量的物质。通过在四个数据集上进行实验 - 散布v2，somethingse，Action Genome和epickitchens-我们表明，以对象模型的表现优于先前的视频表示（对象 - 敏捷和对象感知），当时：（1）对未见对象和未见环境进行分类；（2）新颖班级的低射门学习；（3）线性探测到其他下游任务；以及（4）用于标准动作分类。

The objective of this work is to learn an object-centric video representation, with the aim of improving transferability to novel tasks, i.e., tasks different from the pre-training task of action classification. To this end, we introduce a new object-centric video recognition model based on a transformer architecture. The model learns a set of object-centric summary vectors for the video, and uses these vectors to fuse the visual and spatio-temporal trajectory 'modalities' of the video clip. We also introduce a novel trajectory contrast loss to further enhance objectness in these summary vectors. With experiments on four datasets -- SomethingSomething-V2, SomethingElse, Action Genome and EpicKitchens -- we show that the object-centric model outperforms prior video representations (both object-agnostic and object-aware), when: (1) classifying actions on unseen objects and unseen environments; (2) low-shot learning of novel classes; (3) linear probe to other downstream tasks; as well as (4) for standard action classification.

下载PDF全文

下载文献需遵守相关版权规定

论文标题