Video2 Commonsense：生成常识性描述以丰富视频字幕

论文标题

Video2 Commonsense：生成常识性描述以丰富视频字幕

Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning

论文作者

Fang, Zhiyuan, Gokhale, Tejas, Banerjee, Pratyay, Baral, Chitta, Yang, Yezhou

论文摘要

字幕是视频理解的至关重要且具有挑战性的任务。在涉及人类等活跃代理的视频中，代理商的行为可以带来现场无数的变化。常规视频字幕反映了可观察到的变化，例如场景中对象的运动，操纵和转换。与图像不同，视频中的动作也固有地与社会方面（例如发生动作发生），效果（因动作而发生的变化）以及描述代理的属性。因此，为了进行视频理解，例如在字幕上或回答有关视频的问题时，必须了解这些常识性方面。我们介绍了直接从视频中生成常识字幕的第一项工作，以描述诸如意图，效果和属性之类的潜在方面。我们提出了一个新的数据集“视频到遵守者（V2C）”，其中包含$ \ sim9k $的人类代理人执行各种动作的视频，并注释了3种常识性描述。此外，我们探讨了基于开放式视频的常识问题回答（V2C-QA）的使用作为丰富我们的字幕的一种方式。生成任务和质量检查任务都可以用来丰富视频字幕。

Captioning is a crucial and challenging task for video understanding. In videos that involve active agents such as humans, the agent's actions can bring about myriad changes in the scene. Observable changes such as movements, manipulations, and transformations of the objects in the scene, are reflected in conventional video captioning. Unlike images, actions in videos are also inherently linked to social aspects such as intentions (why the action is taking place), effects (what changes due to the action), and attributes that describe the agent. Thus for video understanding, such as when captioning videos or when answering questions about videos, one must have an understanding of these commonsense aspects. We present the first work on generating commonsense captions directly from videos, to describe latent aspects such as intentions, effects, and attributes. We present a new dataset "Video-to-Commonsense (V2C)" that contains $\sim9k$ videos of human agents performing various actions, annotated with 3 types of commonsense descriptions. Additionally we explore the use of open-ended video-based commonsense question answering (V2C-QA) as a way to enrich our captions. Both the generation task and the QA task can be used to enrich video captions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题