显示，解释和讲述：Wikipedia中的实体感知的上下文化图像字幕

论文标题

显示，解释和讲述：Wikipedia中的实体感知的上下文化图像字幕

Show, Interpret and Tell: Entity-aware Contextualised Image Captioning in Wikipedia

论文作者

Nguyen, Khanh, Biten, Ali Furkan, Mafla, Andres, Gomez, Lluis, Karatzas, Dimosthenis

论文摘要

人类利用先验知识来描述图像，并能够将其解释适应特定的上下文信息，即使在上下文信息和图像不匹配的情况下，也可以在发明合理的解释的范围内。在这项工作中，我们提出了通过整合上下文知识来字幕Wikipedia图像的新颖任务。具体而言，我们制作的模型共同推理了Wikipedia文章，Wikimedia图像及其相关描述以产生上下文化的字幕。特别是，可以使用类似的Wikimedia图像来说明不同的文章，并且需要将所产生的字幕改编为特定上下文，因此使我们能够探索模型的限制，以调整字幕至不同的上下文信息。该领域中的一个特殊挑战性的任务是处理量不多的单词和命名实体。为了解决这个问题，我们提出了一个预训练目标，掩盖了命名实体建模（MNEM），并表明与基线模型相比，此借口任务可以改善。此外，我们验证了Wikipedia中使用MNEM目标预先训练的模型可以很好地推广到新闻字幕数据集。此外，我们根据字幕任务的难度定义了两个不同的测试拆分。我们提供有关每种方式的作用和重要性的见解，并突出我们模型的局限性。接受代码，模型和数据拆分在接受后公开可用。

Humans exploit prior knowledge to describe images, and are able to adapt their explanation to specific contextual information, even to the extent of inventing plausible explanations when contextual information and images do not match. In this work, we propose the novel task of captioning Wikipedia images by integrating contextual knowledge. Specifically, we produce models that jointly reason over Wikipedia articles, Wikimedia images and their associated descriptions to produce contextualized captions. Particularly, a similar Wikimedia image can be used to illustrate different articles, and the produced caption needs to be adapted to a specific context, therefore allowing us to explore the limits of a model to adjust captions to different contextual information. A particular challenging task in this domain is dealing with out-of-dictionary words and Named Entities. To address this, we propose a pre-training objective, Masked Named Entity Modeling (MNEM), and show that this pretext task yields an improvement compared to baseline models. Furthermore, we verify that a model pre-trained with the MNEM objective in Wikipedia generalizes well to a News Captioning dataset. Additionally, we define two different test splits according to the difficulty of the captioning task. We offer insights on the role and the importance of each modality and highlight the limitations of our model. The code, models and data splits are publicly available at Upon acceptance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题