论文标题
用多模式检索命名实体和关系提取
Named Entity and Relation Extraction with Multi-Modal Retrieval
论文作者
论文摘要
多模式命名实体识别(NER)和关系提取(RE)旨在利用相关的图像信息来提高NER和RE的性能。大多数现有的努力主要集中在直接从图像(例如像素级特征,已识别的对象和相关字幕)中提取潜在有用的信息。但是,这种提取过程可能不知道知识,从而导致信息可能与高度相关。在本文中,我们提出了一个新型的基于多模式检索的框架(更多)。更多内容包含一个文本检索模块和一个基于图像的检索模块,该模块分别检索了知识语料库中输入文本和图像的相关知识。接下来,检索结果分别发送到文本和视觉模型以进行预测。最后,专家(MOE)模块的混合结合了两个模型中的预测,以做出最终决定。我们的实验表明,我们的文本模型和视觉模型都可以在四个多模式NER数据集和一个多模式RE数据集上实现最先进的性能。借助MOE,可以进一步改善模型性能,我们的分析证明了整合此类任务的文本和视觉提示的好处。
Multi-modal named entity recognition (NER) and relation extraction (RE) aim to leverage relevant image information to improve the performance of NER and RE. Most existing efforts largely focused on directly extracting potentially useful information from images (such as pixel-level features, identified objects, and associated captions). However, such extraction processes may not be knowledge aware, resulting in information that may not be highly relevant. In this paper, we propose a novel Multi-modal Retrieval based framework (MoRe). MoRe contains a text retrieval module and an image-based retrieval module, which retrieve related knowledge of the input text and image in the knowledge corpus respectively. Next, the retrieval results are sent to the textual and visual models respectively for predictions. Finally, a Mixture of Experts (MoE) module combines the predictions from the two models to make the final decision. Our experiments show that both our textual model and visual model can achieve state-of-the-art performance on four multi-modal NER datasets and one multi-modal RE dataset. With MoE, the model performance can be further improved and our analysis demonstrates the benefits of integrating both textual and visual cues for such tasks.