论文标题
带有多层次融合的混合变压器用于多模式知识图完成
Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion
论文作者
论文摘要
组织视觉文本事实知识的多模式知识图(MKGS)最近成功地应用于信息检索,问答和推荐系统等任务。由于大多数MKG远非完整,因此已经提出了广泛的知识图完成研究,重点是多模式实体,关系提取和链接预测。但是,不同的任务和模式需要更改模型体系结构,而并非所有图像/对象都与文本输入有关,这阻碍了对不同现实世界情景的适用性。在本文中,我们提出了一个具有多层融合的混合变压器来解决这些问题。具体而言,我们利用具有统一输入输出的混合变压器体系结构来用于多种多模式知识图完成任务。此外,我们提出了多级融合,该融合通过粗粒的前缀指导的相互作用和细粒度相关感知的融合模块整合了视觉和文本表示。我们进行了广泛的实验,以验证我们的Mkgformer可以在多模式链路预测,多模式RE和多模式NER的四个数据集中获得SOTA性能。代码可在https://github.com/zjunlp/mkggormer中找到。
Multimodal Knowledge Graphs (MKGs), which organize visual-text factual knowledge, have recently been successfully applied to tasks such as information retrieval, question answering, and recommendation system. Since most MKGs are far from complete, extensive knowledge graph completion studies have been proposed focusing on the multimodal entity, relation extraction and link prediction. However, different tasks and modalities require changes to the model architecture, and not all images/objects are relevant to text input, which hinders the applicability to diverse real-world scenarios. In this paper, we propose a hybrid transformer with multi-level fusion to address those issues. Specifically, we leverage a hybrid transformer architecture with unified input-output for diverse multimodal knowledge graph completion tasks. Moreover, we propose multi-level fusion, which integrates visual and text representation via coarse-grained prefix-guided interaction and fine-grained correlation-aware fusion modules. We conduct extensive experiments to validate that our MKGformer can obtain SOTA performance on four datasets of multimodal link prediction, multimodal RE, and multimodal NER. Code is available in https://github.com/zjunlp/MKGformer.