论文标题
seqdialn:联合视觉语言表示空间中的顺序视觉对话框网络
SeqDialN: Sequential Visual Dialog Networks in Joint Visual-Linguistic Representation Space
论文作者
论文摘要
在这项工作中,我们将视觉对话框制定为信息流,其中每条信息都用单个对话框的联合视觉语言表示编码。基于此公式,我们将视觉对话框任务视为由有序的视觉语言向量组成的序列问题。为了特征化,我们将密集的对称共同注意网络用作轻巧的Vison-Language联合表示生成器来融合多模式特征(即图像和文本),从而得出更好的计算和数据效率。对于推断,我们提出了两个顺序对话框网络(SEQDIALN):第一个使用LSTM进行信息传播(IP),第二个使用修改后的变压器进行多步推理(MR)。我们的体系结构将多模式特征融合的复杂性与推理的复杂性分开,这可以更简单地设计推理引擎。基于IP的SEQDIALN是我们的基线,具有简单的2层LSTM设计,可实现不错的性能。另一方面,先生的Seqdialn先生通过自我发挥的变压器堆栈重复完善了语义问题/历史表征,并在视觉对话任务上产生了令人鼓舞的结果。在Visdial V1.0 Test-STD数据集上,我们最好的单一生成性Seqdialn可实现62.54%NDCG和48.63%MRR;我们的合奏生成性seqdialn达到了63.78%的NDCG和49.98%的MRR,该MRR设定了新的最先进的生成视觉对话框模型。我们用致密的注释微调判别性seqdialn,并提高性能高达72.41%NDCG和55.11%的MRR。在这项工作中,我们讨论了我们进行的广泛实验,以证明模型组件的有效性。我们还可以从相关的对话回合中为推理过程提供可视化,并讨论我们的微调方法。我们的代码可从https://github.com/xiaoxiaoheimei/seqdialn获得
In this work, we formulate a visual dialog as an information flow in which each piece of information is encoded with the joint visual-linguistic representation of a single dialog round. Based on this formulation, we consider the visual dialog task as a sequence problem consisting of ordered visual-linguistic vectors. For featurization, we use a Dense Symmetric Co-Attention network as a lightweight vison-language joint representation generator to fuse multimodal features (i.e., image and text), yielding better computation and data efficiencies. For inference, we propose two Sequential Dialog Networks (SeqDialN): the first uses LSTM for information propagation (IP) and the second uses a modified Transformer for multi-step reasoning (MR). Our architecture separates the complexity of multimodal feature fusion from that of inference, which allows simpler design of the inference engine. IP based SeqDialN is our baseline with a simple 2-layer LSTM design that achieves decent performance. MR based SeqDialN, on the other hand, recurrently refines the semantic question/history representations through the self-attention stack of Transformer and produces promising results on the visual dialog task. On VisDial v1.0 test-std dataset, our best single generative SeqDialN achieves 62.54% NDCG and 48.63% MRR; our ensemble generative SeqDialN achieves 63.78% NDCG and 49.98% MRR, which set a new state-of-the-art generative visual dialog model. We fine-tune discriminative SeqDialN with dense annotations and boost the performance up to 72.41% NDCG and 55.11% MRR. In this work, we discuss the extensive experiments we have conducted to demonstrate the effectiveness of our model components. We also provide visualization for the reasoning process from the relevant conversation rounds and discuss our fine-tuning methods. Our code is available at https://github.com/xiaoxiaoheimei/SeqDialN