论文标题
显示,编辑和讲述:编辑图像标题的框架
Show, Edit and Tell: A Framework for Editing Image Captions
论文作者
论文摘要
大多数图像字幕框架直接从图像中生成字幕,学习从视觉特征到自然语言的映射。但是,编辑现有字幕比从头开始生成新的字幕更容易。直观地,在编辑字幕时,不需要模型来学习标题中已经存在的信息(即句子结构),从而使其能够专注于修复细节(例如,更换重复的单词)。本文提出了一种基于现有标题的迭代适应性改进的新颖方法来图像字幕。具体而言,我们的字幕编辑模型由两个子模型组成:(1)Editnet,具有自适应复制机制(COPY-LSTM)的语言模块和一个选择性复制存储器注意机制(SCMA)和(2)DCNET,基于LSTM的DENOO自动编码器。这些组件使我们的模型能够直接复制并修改现有字幕。实验表明,我们的新方法在有和没有序列级训练的情况下,在MS可可数据集上实现了最先进的性能。
Most image captioning frameworks generate captions directly from images, learning a mapping from visual features to natural language. However, editing existing captions can be easier than generating new ones from scratch. Intuitively, when editing captions, a model is not required to learn information that is already present in the caption (i.e. sentence structure), enabling it to focus on fixing details (e.g. replacing repetitive words). This paper proposes a novel approach to image captioning based on iterative adaptive refinement of an existing caption. Specifically, our caption-editing model consisting of two sub-modules: (1) EditNet, a language module with an adaptive copy mechanism (Copy-LSTM) and a Selective Copy Memory Attention mechanism (SCMA), and (2) DCNet, an LSTM-based denoising auto-encoder. These components enable our model to directly copy from and modify existing captions. Experiments demonstrate that our new approach achieves state-of-art performance on the MS COCO dataset both with and without sequence-level training.