论文标题
vd-bert:与伯特的统一视觉和对话变压器
VD-BERT: A Unified Vision and Dialog Transformer with BERT
论文作者
论文摘要
视觉对话框是一项具有挑战性的视觉语言任务,对话框代理需要通过对图像内容和对话框历史记录进行推理来回答一系列问题。先前的工作主要集中在各种注意机制上,以建模这种复杂的相互作用。相比之下,在这项工作中,我们提出了VD-bert,这是一个简单而有效的统一视觉拨号变压器的框架,它利用验证的BERT语言模型来实现视觉对话任务。该模型是统一的,因为(1)它使用单个式变压器编码器捕获了图像和多转向对话框之间的所有相互作用,并且(2)它支持答案排名和通过相同的体系结构无缝地回答生成。更重要的是,我们通过视觉接地训练来适应BERT的有效融合视力和对话内容。我们的模型不需要预处理外部视力语言数据,就可以产生新的最新状态,从而在视觉对话排行榜上达到了单模型和合奏设置(74.54和75.35 NDCG分数)的最高位置。我们的代码和预算模型在https://github.com/salesforce/vd-bert上发布。
Visual dialog is a challenging vision-language task, where a dialog agent needs to answer a series of questions through reasoning on the image content and dialog history. Prior work has mostly focused on various attention mechanisms to model such intricate interactions. By contrast, in this work, we propose VD-BERT, a simple yet effective framework of unified vision-dialog Transformer that leverages the pretrained BERT language models for Visual Dialog tasks. The model is unified in that (1) it captures all the interactions between the image and the multi-turn dialog using a single-stream Transformer encoder, and (2) it supports both answer ranking and answer generation seamlessly through the same architecture. More crucially, we adapt BERT for the effective fusion of vision and dialog contents via visually grounded training. Without the need of pretraining on external vision-language data, our model yields new state of the art, achieving the top position in both single-model and ensemble settings (74.54 and 75.35 NDCG scores) on the visual dialog leaderboard. Our code and pretrained models are released at https://github.com/salesforce/VD-BERT.