多模式视频分会一代

论文标题

多模式视频分会一代

Multi-modal Video Chapter Generation

论文作者

Cao, Xiao, Chen, Zitan, Le, Canyu, Meng, Lei

论文摘要

当今，分会一代成为在线视频的实用技术。本章断点使用户能够快速找到所需的零件并获取总结性注释。但是，没有针对此任务的公共方法和数据集。为了促进该方向的研究，我们介绍了一个名为Chapter-gen的新数据集，该数据集由大约10K用户生成的视频和带注释的章节信息组成。我们的数据收集过程是快速，可扩展的，不需要任何其他手动注释。在此数据集之外，我们设计了一个针对视频章节生成任务的有效基线。捕获了视频的两个方面，包括视觉动态和叙述文本。它分别将本地和全球视频功能分别用于本地化和标题生成。为了有效地解析长视频，跳过滑动窗口机构旨在定位潜在的章节。并且开发了交叉注意的多模式融合模块，以汇总标题生成的本地特征。我们的实验表明，所提出的框架比现有方法取得了优越的结果，这表明即使在微调后也无法直接传输类似任务的方法设计。代码和数据集可在https://github.com/czt117/mvcg上找到。

Chapter generation becomes practical technique for online videos nowadays. The chapter breakpoints enable users to quickly find the parts they want and get the summative annotations. However, there is no public method and dataset for this task. To facilitate the research along this direction, we introduce a new dataset called Chapter-Gen, which consists of approximately 10k user-generated videos with annotated chapter information. Our data collection procedure is fast, scalable and does not require any additional manual annotation. On top of this dataset, we design an effective baseline specificlly for video chapters generation task. which captures two aspects of a video,including visual dynamics and narration text. It disentangles local and global video features for localization and title generation respectively. To parse the long video efficiently, a skip sliding window mechanism is designed to localize potential chapters. And a cross attention multi-modal fusion module is developed to aggregate local features for title generation. Our experiments demonstrate that the proposed framework achieves superior results over existing methods which illustrate that the method design for similar task cannot be transfered directly even after fine-tuning. Code and dataset are available at https://github.com/czt117/MVCG.

下载PDF全文

下载文献需遵守相关版权规定

论文标题