论文标题
学习图像字幕的独特和代表性样式
Learning Distinct and Representative Styles for Image Captioning
论文作者
论文摘要
多年来,最新的(SOTA)图像字幕方法已在某些评估指标(例如苹果酒)上取得了令人鼓舞的结果。但是,最近的发现表明,这些方法产生的字幕往往会偏向“平均”字幕,该字幕仅捕获训练语料库中最通用的模式(又称语言模式),即所谓的模式崩溃问题。受其影响的影响,产生的标题在多样性上受到限制,通常不如人类做出的自然图像描述。在本文中,我们试图通过提出离散模式学习(DML)范式来避免此问题。我们的创新想法是探索训练字幕语料库中的丰富模式,以学习一组“模式嵌入”,并进一步使用它们来控制现有图像字幕模型生成的字幕模式。具体而言,所提出的DML优化了由图像条件的离散变异自动编码器(CDVAE)分支和模式条件的图像字幕(MIC)分支组成的双重体系结构。 CDVAE分支将每个图像字幕映射到存储在学习的代码簿中的模式嵌入之一,并接受了纯粹的非自动性生成目标训练,以使模式与众不同和代表性。可以简单地从现有的图像字幕模型中修改麦克风分支,其中将模式嵌入添加到原始单词嵌入作为控制信号中。在实验中,我们将提出的DML应用于两个广泛使用的图像字幕模型,即变压器和AOANET。结果表明,学习模式嵌入成功促进了这些模型,以产生不同模式的高质量图像标题,进一步为MSCOCO数据集的多样性和质量提供了更好的性能。
Over the years, state-of-the-art (SoTA) image captioning methods have achieved promising results on some evaluation metrics (e.g., CIDEr). However, recent findings show that the captions generated by these methods tend to be biased toward the "average" caption that only captures the most general mode (a.k.a, language pattern) in the training corpus, i.e., the so-called mode collapse problem. Affected by it, the generated captions are limited in diversity and usually less informative than natural image descriptions made by humans. In this paper, we seek to avoid this problem by proposing a Discrete Mode Learning (DML) paradigm for image captioning. Our innovative idea is to explore the rich modes in the training caption corpus to learn a set of "mode embeddings", and further use them to control the mode of the generated captions for existing image captioning models. Specifically, the proposed DML optimizes a dual architecture that consists of an image-conditioned discrete variational autoencoder (CdVAE) branch and a mode-conditioned image captioning (MIC) branch. The CdVAE branch maps each image caption to one of the mode embeddings stored in a learned codebook, and is trained with a pure non-autoregressive generation objective to make the modes distinct and representative. The MIC branch can be simply modified from an existing image captioning model, where the mode embedding is added to the original word embeddings as the control signal. In the experiments, we apply the proposed DML to two widely used image captioning models, Transformer and AoANet. The results show that the learned mode embedding successfully facilitates these models to generate high-quality image captions with different modes, further leading to better performance for both diversity and quality on the MSCOCO dataset.