从多模式表示的视觉关系检测与视觉语言知识

论文标题

从多模式表示的视觉关系检测与视觉语言知识

Visual Relationship Detection with Visual-Linguistic Knowledge from Multimodal Representations

论文作者

Chiou, Meng-Jiun, Zimmermann, Roger, Feng, Jiashi

论文摘要

视觉关系检测旨在推理图像中显着对象之间的关系，这在过去几年中引起了人们的关注。受到人类推理机制的启发，据信外部视觉常识知识对图像中对象的视觉关系有益，但是在现有方法中很少考虑。在本文中，我们提出了一种新颖的方法，称为Transformers（RVL-Bert）的称为关系视觉语言的双向编码器表示，该方法通过通过自学和语言常识性知识通过自我监督的预训练和多模态表示来执行关系推理。 RVL-bert还使用有效的空间模块和一个新颖的掩码注意模块来明确捕获对象之间的空间信息。此外，我们的模型通过直接吸入对象名称将对象检测从视觉关系识别中解除，从而可以在任何对象检测系统的顶部使用它。我们通过定量和定性实验来展示，通过转移的知识和新型模块，RVL-Bert在两个具有挑战性的视觉关系检测数据集上取得了竞争成果。源代码可在https://github.com/coldmanck/rvl-bert上获得。

Visual relationship detection aims to reason over relationships among salient objects in images, which has drawn increasing attention over the past few years. Inspired by human reasoning mechanisms, it is believed that external visual commonsense knowledge is beneficial for reasoning visual relationships of objects in images, which is however rarely considered in existing methods. In this paper, we propose a novel approach named Relational Visual-Linguistic Bidirectional Encoder Representations from Transformers (RVL-BERT), which performs relational reasoning with both visual and language commonsense knowledge learned via self-supervised pre-training with multimodal representations. RVL-BERT also uses an effective spatial module and a novel mask attention module to explicitly capture spatial information among the objects. Moreover, our model decouples object detection from visual relationship recognition by taking in object names directly, enabling it to be used on top of any object detection system. We show through quantitative and qualitative experiments that, with the transferred knowledge and novel modules, RVL-BERT achieves competitive results on two challenging visual relationship detection datasets. The source code is available at https://github.com/coldmanck/RVL-BERT.

下载PDF全文

下载文献需遵守相关版权规定

论文标题