论文标题
视频摘要的多模式得分变压器
Multimodal Frame-Scoring Transformer for Video Summarization
论文作者
论文摘要
由于近年来的视频内容数量已经涌动,因此当我们只想窥视视频内容时,自动视频摘要就变得有用。但是,通用视频摘要任务中有两个基本局限性。首先,大多数以前的方法仅以视觉特征为输入,而将其他模态特征留在后面。其次,现有用于通用视频摘要的数据集相对不足以训练用于从视频中提取文本信息并训练多模式特征提取器的字幕生成器。为了解决这两个问题,本文提出了多模式框架得分变压器(MFST),这是一个利用视觉,文本和音频功能的框架,并在帧方面对视频进行了评分。我们的MFST框架首先使用验证的编码器提取每个模式特征(音频 - 视文)。然后,MFST训练多模式框架得分变压器,该变压器使用基于提取的特征作为输入的多模式表示形式,并预测框架级别的得分。我们对先前模型和电视数据集的消融研究进行了广泛的实验,这表明了我们提出的方法的有效性和优越性,这是通过F1分数和基于等级的评估的较大差距。
As the number of video content has mushroomed in recent years, automatic video summarization has come useful when we want to just peek at the content of the video. However, there are two underlying limitations in generic video summarization task. First, most previous approaches read in just visual features as input, leaving other modality features behind. Second, existing datasets for generic video summarization are relatively insufficient to train a caption generator used for extracting text information from a video and to train the multimodal feature extractors. To address these two problems, this paper proposes the Multimodal Frame-Scoring Transformer (MFST), a framework exploiting visual, text, and audio features and scoring a video with respect to frames. Our MFST framework first extracts each modality features (audio-visual-text) using pretrained encoders. Then, MFST trains the multimodal frame-scoring transformer that uses multimodal representation based on extracted features as inputs and predicts frame-level scores. Our extensive experiments with previous models and ablation studies on TVSum and SumMe datasets demonstrate the effectiveness and superiority of our proposed method by a large margin in both F1 score and Rank-based evaluation.