VMSMO：学习为基于视频的新闻文章生成多模式摘要

论文标题

VMSMO：学习为基于视频的新闻文章生成多模式摘要

VMSMO: Learning to Generate Multimodal Summary for Video-based News Articles

论文作者

Li, Mingzhe, Chen, Xiuying, Gao, Shen, Chan, Zhangming, Zhao, Dongyan, Yan, Rui

论文摘要

如今，一种流行的多媒体新闻格式正在为用户提供活泼的视频和相应的新闻文章，该文章由CNN，BBC和包括Twitter和Weibo在内的有影响力的新闻媒体使用。在这种情况下，会自动选择视频的正确封面框架并生成文章的适当文本摘要可以帮助编辑节省时间，读者更有效地做出决定。因此，在本文中，我们提出了具有多模式输出（VMSMO）的基于视频的多模式摘要的任务，以解决此类问题。此任务的主要挑战是将视频的时间依赖性与文章的语义含义共同建模。为此，我们提出了一个基于双重关系的多模式摘要（DIMS），由双重交互模块和多模式发生器组成。在双重交互模块中，我们提出了一种有条件的自我注意事项机制，该机制在视频中捕获本地语义信息以及一种全球性意见机制，该机制从高水平中处理新闻文本和视频之间的语义关系。在大规模的现实世界VMSMO数据集上进行的广泛实验显示，DIMS在自动指标和人类评估方面都实现了最新的性能。

A popular multimedia news format nowadays is providing users with a lively video and a corresponding news article, which is employed by influential news media including CNN, BBC, and social media including Twitter and Weibo. In such a case, automatically choosing a proper cover frame of the video and generating an appropriate textual summary of the article can help editors save time, and readers make the decision more effectively. Hence, in this paper, we propose the task of Video-based Multimodal Summarization with Multimodal Output (VMSMO) to tackle such a problem. The main challenge in this task is to jointly model the temporal dependency of video with semantic meaning of article. To this end, we propose a Dual-Interaction-based Multimodal Summarizer (DIMS), consisting of a dual interaction module and multimodal generator. In the dual interaction module, we propose a conditional self-attention mechanism that captures local semantic information within video and a global-attention mechanism that handles the semantic relationship between news text and video from a high level. Extensive experiments conducted on a large-scale real-world VMSMO dataset show that DIMS achieves the state-of-the-art performance in terms of both automatic metrics and human evaluations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题