论文标题
Movie2Scenes:使用电影元数据学习场景表示
Movies2Scenes: Using Movie Metadata to Learn Scene Representation
论文作者
论文摘要
了解电影中的场景对于各种应用程序(例如视频审核,搜索和推荐)至关重要。但是,标记各个场景是一个耗时的过程。相比之下,电影水平的元数据(例如,类型,概要等)定期作为电影生产过程的一部分生产,因此更常见。在这项工作中,我们提出了一种新颖的对比学习方法,该方法使用电影元数据来学习通用场景表示。具体来说,我们使用电影元数据来定义电影相似性的量度,并在对比度学习过程中使用它来限制我们对积极场景对的搜索,而仅将其视为彼此相似的电影。我们博学的场景表示形式始终在使用多个基准数据集评估的一组任务上胜过现有的最新方法。值得注意的是,我们学到的表示的七个分类任务的平均提高了7.9%,而LVU数据集中的两个回归任务提高了9.7%。此外,使用新收集的电影数据集,我们在一组视频审核任务上介绍了场景表示的比较结果,以证明其对以前较少探索的任务的普遍性。
Understanding scenes in movies is crucial for a variety of applications such as video moderation, search, and recommendation. However, labeling individual scenes is a time-consuming process. In contrast, movie level metadata (e.g., genre, synopsis, etc.) regularly gets produced as part of the film production process, and is therefore significantly more commonly available. In this work, we propose a novel contrastive learning approach that uses movie metadata to learn a general-purpose scene representation. Specifically, we use movie metadata to define a measure of movie similarity, and use it during contrastive learning to limit our search for positive scene-pairs to only the movies that are considered similar to each other. Our learned scene representation consistently outperforms existing state-of-the-art methods on a diverse set of tasks evaluated using multiple benchmark datasets. Notably, our learned representation offers an average improvement of 7.9% on the seven classification tasks and 9.7% improvement on the two regression tasks in LVU dataset. Furthermore, using a newly collected movie dataset, we present comparative results of our scene representation on a set of video moderation tasks to demonstrate its generalizability on previously less explored tasks.