论文标题
捉迷藏:学会桥接视觉讲故事的照片流
Hide-and-Tell: Learning to Bridge Photo Streams for Visual Storytelling
论文作者
论文摘要
视觉讲故事是根据照片流创建一个简短故事的任务。与现有的视觉字幕不同,讲故事的目的不仅包含事实描述,还包含类似人类的叙事和语义。但是,VIST数据集仅由每个故事数量的少量固定的照片组成。因此,视觉讲故事的主要挑战是用叙事和富有想象力的故事来填补照片之间的视觉差距。在本文中,我们建议明确学会想象一个故事情节弥合了视觉差距。在培训过程中,从输入堆栈中随机省略了一张或多张照片,即使缺少照片,我们也会训练网络制作一个完整的合理故事。此外,我们提出了视觉讲故事的建议,该模型旨在学习整个照片流的非本地关系,并改善和改善常规RNN的模型。在实验中,我们表明我们的藏身和网络设计的计划确实在讲故事方面有效,并且我们的模型在自动指标中的先前最新方法优于先前的最新方法。最后,我们定性地展示了在视觉差距上插值故事情节的学识渊博的能力。
Visual storytelling is a task of creating a short story based on photo streams. Unlike existing visual captioning, storytelling aims to contain not only factual descriptions, but also human-like narration and semantics. However, the VIST dataset consists only of a small, fixed number of photos per story. Therefore, the main challenge of visual storytelling is to fill in the visual gap between photos with narrative and imaginative story. In this paper, we propose to explicitly learn to imagine a storyline that bridges the visual gap. During training, one or more photos is randomly omitted from the input stack, and we train the network to produce a full plausible story even with missing photo(s). Furthermore, we propose for visual storytelling a hide-and-tell model, which is designed to learn non-local relations across the photo streams and to refine and improve conventional RNN-based models. In experiments, we show that our scheme of hide-and-tell, and the network design are indeed effective at storytelling, and that our model outperforms previous state-of-the-art methods in automatic metrics. Finally, we qualitatively show the learned ability to interpolate storyline over visual gaps.