两行之间的阅读：探索视觉叙事中的填充

论文标题

两行之间的阅读：探索视觉叙事中的填充

Reading Between the Lines: Exploring Infilling in Visual Narratives

论文作者

Chandu, Khyathi Raghavi, Dong, Ruo-Ping, Black, Alan

论文摘要

产生长期形式的叙事，例如来自多种方式的故事和程序，是人工智能的漫长梦想。在这方面，通常存在从周围环境中得出的至关重要的潜台词。一般的SEQ2SEQ训练方法使模型在试图弥合这些相邻环境之间的差距时进行了刺激。在本文中，我们通过使用\ textit {填充}技术来解决此问题，涉及叙事中缺少步骤的预测，同时从一系列图像中生成文本描述。我们还提出了一个新的大型\ textit {可视过程tell}（vipt）数据集，总共有46,200个过程，约340k的成对图像和文本描述，这些图像和文本描述丰富了这种上下文依赖性。使用填充技术生成步骤证明了具有更连贯的文本的视觉过程中的有效性。我们最终的流星得分为27.51，比视觉讲故事的最先进的程序高。我们还演示了推断期间与缺少图像的新文本的效果。该代码和数据集将在https://visual-narrative.github.io/visual-narrations/上公开可用。

Generating long form narratives such as stories and procedures from multiple modalities has been a long standing dream for artificial intelligence. In this regard, there is often crucial subtext that is derived from the surrounding contexts. The general seq2seq training methods render the models shorthanded while attempting to bridge the gap between these neighbouring contexts. In this paper, we tackle this problem by using \textit{infilling} techniques involving prediction of missing steps in a narrative while generating textual descriptions from a sequence of images. We also present a new large scale \textit{visual procedure telling} (ViPT) dataset with a total of 46,200 procedures and around 340k pairwise images and textual descriptions that is rich in such contextual dependencies. Generating steps using infilling technique demonstrates the effectiveness in visual procedures with more coherent texts. We conclusively show a METEOR score of 27.51 on procedures which is higher than the state-of-the-art on visual storytelling. We also demonstrate the effects of interposing new text with missing images during inference. The code and the dataset will be publicly available at https://visual-narratives.github.io/Visual-Narratives/.

下载PDF全文

下载文献需遵守相关版权规定

论文标题