更好地利用视听提示：使用双模式变压器的密集视频字幕

论文标题

更好地利用视听提示：使用双模式变压器的密集视频字幕

A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer

论文作者

Iashin, Vladimir, Rahtu, Esa

论文摘要

密集的视频字幕旨在在未修剪的视频中本地化和描述重要事件。现有方法主要通过仅利用视觉功能来解决此任务，同时完全忽略了音轨。只有少数几项工作都使用了两种模式，但它们表现出差的结果或证明了具有特定域的数据集的重要性。在本文中，我们介绍了双模式变压器，该变压器概括了变压器架构的双模式输入。我们在密集的视频字幕任务上以音频和视觉方式显示了提出的模型的有效性，但是该模块能够在序列到序列任务中消化任何两种方式。我们还表明，预先训练的双模式编码器作为双模式变压器的一部分可以用作简单提案生成模块的特征提取器。在具有挑战性的活动网络标题数据集中，该性能在我们的模型达到出色的性能中得到了证明。代码可用：v-iashin.github.io/bmt

Dense video captioning aims to localize and describe important events in untrimmed videos. Existing methods mainly tackle this task by exploiting only visual features, while completely neglecting the audio track. Only a few prior works have utilized both modalities, yet they show poor results or demonstrate the importance on a dataset with a specific domain. In this paper, we introduce Bi-modal Transformer which generalizes the Transformer architecture for a bi-modal input. We show the effectiveness of the proposed model with audio and visual modalities on the dense video captioning task, yet the module is capable of digesting any two modalities in a sequence-to-sequence task. We also show that the pre-trained bi-modal encoder as a part of the bi-modal transformer can be used as a feature extractor for a simple proposal generation module. The performance is demonstrated on a challenging ActivityNet Captions dataset where our model achieves outstanding performance. The code is available: v-iashin.github.io/bmt

下载PDF全文

下载文献需遵守相关版权规定

论文标题