论文标题
多图像摘要:一组凝聚力图像的文本摘要
Multi-Image Summarization: Textual Summary from a Set of Cohesive Images
论文作者
论文摘要
多句子摘要是NLP中研究的问题,而为单个图像生成图像描述是计算机视觉中的一个精心研究的问题。但是,对于诸如图像群集标签或网页摘要之类的应用程序,汇总一组图像也是一项有用且具有挑战性的任务。本文提出了多图像摘要的新任务,该任务旨在生成一组连贯的输入图像集的简洁而描述性的文本摘要。我们提出了一个模型,该模型将基于图像的基于图像变压器的体系结构扩展到多图像。密集的平均图像特征聚合网络使模型可以专注于输入图像的连贯子集。我们探索了变压器网络的各种输入表示形式,并从经验上表明,汇总的图像特征优于单个图像嵌入。我们还表明,通过在单像字幕任务上预处理模型参数进一步提高模型的性能,这似乎在消除输出中的幻觉方面特别有效。
Multi-sentence summarization is a well studied problem in NLP, while generating image descriptions for a single image is a well studied problem in Computer Vision. However, for applications such as image cluster labeling or web page summarization, summarizing a set of images is also a useful and challenging task. This paper proposes the new task of multi-image summarization, which aims to generate a concise and descriptive textual summary given a coherent set of input images. We propose a model that extends the image-captioning Transformer-based architecture for single image to multi-image. A dense average image feature aggregation network allows the model to focus on a coherent subset of attributes across the input images. We explore various input representations to the Transformer network and empirically show that aggregated image features are superior to individual image embeddings. We additionally show that the performance of the model is further improved by pretraining the model parameters on a single-image captioning task, which appears to be particularly effective in eliminating hallucinations in the output.