论文标题
paracnn:通过对抗性双上下文CNN的视觉段落生成
ParaCNN: Visual Paragraph Generation via Adversarial Twin Contextual CNNs
论文作者
论文摘要
图像描述生成在许多现实世界应用中都起着重要作用,例如图像检索,自动导航和残疾人支持。图像描述生成的一项完善的任务是图像字幕,通常会生成一个简短的字幕句子,因此忽略了许多细粒度的属性,例如,微妙的对象及其关系的信息。在本文中,我们研究了视觉段落的生成,可以用包含丰富细节的长段描述图像。先前的研究通常通过层次复发神经网络(RNN)样模型生成段落,该模型具有复杂的记忆,遗忘和耦合机制。取而代之的是,我们提出了一种新颖的纯CNN模型Paracnn,以使用层次结构CNN体系结构生成视觉段落,并在一个段落内的句子之间使用上下文信息。 paracnn可以生成任意长度的段落,这更适用于许多现实世界中的应用程序。此外,为了使Paracnn能够全面地建模段落,我们还提出了一种对抗性双网训练计划。在培训期间,我们通过使用对抗性训练迫使转发网络的隐藏功能接近向后网络的功能。在测试过程中,我们仅使用已经包含向后网络知识的转发网络来生成段落。我们在斯坦福大学Visual段落数据集上进行了广泛的实验,并实现最先进的性能。
Image description generation plays an important role in many real-world applications, such as image retrieval, automatic navigation, and disabled people support. A well-developed task of image description generation is image captioning, which usually generates a short captioning sentence and thus neglects many of fine-grained properties, e.g., the information of subtle objects and their relationships. In this paper, we study the visual paragraph generation, which can describe the image with a long paragraph containing rich details. Previous research often generates the paragraph via a hierarchical Recurrent Neural Network (RNN)-like model, which has complex memorising, forgetting and coupling mechanism. Instead, we propose a novel pure CNN model, ParaCNN, to generate visual paragraph using hierarchical CNN architecture with contextual information between sentences within one paragraph. The ParaCNN can generate an arbitrary length of a paragraph, which is more applicable in many real-world applications. Furthermore, to enable the ParaCNN to model paragraph comprehensively, we also propose an adversarial twin net training scheme. During training, we force the forwarding network's hidden features to be close to that of the backwards network by using adversarial training. During testing, we only use the forwarding network, which already includes the knowledge of the backwards network, to generate a paragraph. We conduct extensive experiments on the Stanford Visual Paragraph dataset and achieve state-of-the-art performance.