将图像字幕与基于规则的实体掩盖

论文标题

将图像字幕与基于规则的实体掩盖

Integrating Image Captioning with Rule-based Entity Masking

论文作者

Mogadala, Aditya, Shen, Xiaoyu, Klakow, Dietrich

论文摘要

给定图像，生成其自然语言描述（即标题）是一个精心研究的问题。建议解决此问题的方法通常取决于难以解释的图像特征。特别是，这些图像特征细分为全局和本地特征，其中从图像的全局表示中提取了全局特征，而局部特征则是从图像中本地检测到的对象中提取的。尽管本地功能从图像中提取丰富的视觉信息，但现有模型以黑盒方式生成字幕，并且人类难以解释标题的目的是代表哪些本地对象。因此，在本文中，我们为图像字幕（例如，知识图实体）选择过程提出了一个新颖的框架，同时仍保持其端到端训练能力。该模型首先明确选择根据人解剖掩码在标题中包含哪些本地实体，然后通过参加选定的实体来生成适当的字幕。在MSCOCO数据集上进行的实验表明，我们的方法在标题质量和多样性方面具有良好的性能，其生成过程比以前的同类产品更容易解释。

Given an image, generating its natural language description (i.e., caption) is a well studied problem. Approaches proposed to address this problem usually rely on image features that are difficult to interpret. Particularly, these image features are subdivided into global and local features, where global features are extracted from the global representation of the image, while local features are extracted from the objects detected locally in an image. Although, local features extract rich visual information from the image, existing models generate captions in a blackbox manner and humans have difficulty interpreting which local objects the caption is aimed to represent. Hence in this paper, we propose a novel framework for the image captioning with an explicit object (e.g., knowledge graph entity) selection process while still maintaining its end-to-end training ability. The model first explicitly selects which local entities to include in the caption according to a human-interpretable mask, then generate proper captions by attending to selected entities. Experiments conducted on the MSCOCO dataset demonstrate that our method achieves good performance in terms of the caption quality and diversity with a more interpretable generating process than previous counterparts.

下载PDF全文

下载文献需遵守相关版权规定

论文标题