稀疏图到序列学习的视觉条件长的长文本顺序生成

论文标题

稀疏图到序列学习的视觉条件长的长文本顺序生成

Sparse Graph to Sequence Learning for Vision Conditioned Long Textual Sequence Generation

论文作者

Mogadala, Aditya, Mosbach, Marius, Klakow, Dietrich

论文摘要

在视觉信息下生成更长的文本序列是一个有趣的问题。这里的挑战激增了标准视觉条件句子级别的生成（例如，图像或视频字幕），因为它需要产生一个简短而连贯的故事来描述视觉内容。在本文中，我们将此愿景到序列掩盖为图序列学习问题，并将其与变压器体系结构接触。要具体而言，我们引入了稀疏的图形到序列变压器（SGST），用于编码图并解码序列。编码器的目的是直接编码图形语义，而解码器则用于生成更长的序列。使用基准图像段数据集进行的实验表明，与先前的最新方法相比，我们提议的苹果酒评估措施提高了13.3％。

Generating longer textual sequences when conditioned on the visual information is an interesting problem to explore. The challenge here proliferate over the standard vision conditioned sentence-level generation (e.g., image or video captioning) as it requires to produce a brief and coherent story describing the visual content. In this paper, we mask this Vision-to-Sequence as Graph-to-Sequence learning problem and approach it with the Transformer architecture. To be specific, we introduce Sparse Graph-to-Sequence Transformer (SGST) for encoding the graph and decoding a sequence. The encoder aims to directly encode graph-level semantics, while the decoder is used to generate longer sequences. Experiments conducted with the benchmark image paragraph dataset show that our proposed achieve 13.3% improvement on the CIDEr evaluation measure when comparing to the previous state-of-the-art approach.

下载PDF全文

下载文献需遵守相关版权规定

论文标题