TexTCAPS：用于图像字幕的数据集，并阅读理解

论文标题

TexTCAPS：用于图像字幕的数据集，并阅读理解

TextCaps: a Dataset for Image Captioning with Reading Comprehension

论文作者

Sidorov, Oleksii, Hu, Ronghang, Rohrbach, Marcus, Singh, Amanpreet

论文摘要

图像描述可以帮助视觉障碍的人快速理解图像内容。尽管我们在自动描述图像和光学特征识别方面取得了重大进展，但当前的方法无法在其描述中包含书面文本，尽管文本在人类环境中无所不在，并且对于了解我们的周围环境而言经常至关重要。为了研究如何在图像的上下文中理解文本，我们收集了一个新颖的数据集，TextCaps，并使用145K标题为28K图像。我们的数据集挑战了一个模型，以识别文本，将其与其视觉上下文相关联，并确定要复制或释义的文本的哪一部分，需要多个文本令牌和视觉实体（例如对象）之间的空间，语义和视觉推理。我们研究基线并将现有方法适应这一新任务，我们将其称为图像字幕，并通过阅读理解。我们对自动和人类研究的分析表明，我们的新TextCaps数据集对先前数据集提供了许多新的技术挑战。

Image descriptions can help visually impaired people to quickly understand the image content. While we made significant progress in automatically describing images and optical character recognition, current approaches are unable to include written text in their descriptions, although text is omnipresent in human environments and frequently critical to understand our surroundings. To study how to comprehend text in the context of an image we collect a novel dataset, TextCaps, with 145k captions for 28k images. Our dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase, requiring spatial, semantic, and visual reasoning between multiple text tokens and visual entities, such as objects. We study baselines and adapt existing approaches to this new task, which we refer to as image captioning with reading comprehension. Our analysis with automatic and human studies shows that our new TextCaps dataset provides many new technical challenges over previous datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题