MART：连贯视频段落字幕的记忆启动重复变压器

论文标题

MART：连贯视频段落字幕的记忆启动重复变压器

MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning

论文作者

Lei, Jie, Wang, Liwei, Shen, Yelong, Yu, Dong, Berg, Tamara L., Bansal, Mohit

论文摘要

为视频生成多句子描述是最具挑战性的字幕任务之一，因为它不仅对视觉相关性，而且对段落中整个句子的基于话语的连贯性的要求很高。为了实现这一目标，我们提出了一种新方法，称为内存启动重复变压器（MART），该方法使用内存模块来增强变压器体系结构。内存模块从视频片段和句子历史记录中生成高度汇总的内存状态，以帮助更好地预测下一个句子（W.R.T. Coreference和Rectiention Actions），从而鼓励连贯的段落生成。在两个流行的数据集活动网络字幕和YouCookii上进行了广泛的实验，人体评估和定性分析表明，MART比基线方法产生更连贯且重复的段落标题，同时保持与输入视频事件的相关性。所有代码均可用：https：//github.com/jayleicn/recurrent-transformer可用

Generating multi-sentence descriptions for videos is one of the most challenging captioning tasks due to its high requirements for not only visual relevance but also discourse-based coherence across the sentences in the paragraph. Towards this goal, we propose a new approach called Memory-Augmented Recurrent Transformer (MART), which uses a memory module to augment the transformer architecture. The memory module generates a highly summarized memory state from the video segments and the sentence history so as to help better prediction of the next sentence (w.r.t. coreference and repetition aspects), thus encouraging coherent paragraph generation. Extensive experiments, human evaluations, and qualitative analyses on two popular datasets ActivityNet Captions and YouCookII show that MART generates more coherent and less repetitive paragraph captions than baseline methods, while maintaining relevance to the input video events. All code is available open-source at: https://github.com/jayleicn/recurrent-transformer

下载PDF全文

下载文献需遵守相关版权规定

论文标题