论文标题
命名实体识别的平面多模式交互变压器
Flat Multi-modal Interaction Transformer for Named Entity Recognition
论文作者
论文摘要
多模式命名实体识别(MNER)旨在借助图像识别实体跨度并在社交媒体帖子中认识其类别。但是,在主要的MNER方法中,通常通过自我注意力和跨注意或对门控机的过度依赖的交替进行不同方式的相互作用,从而导致文本和图像的细粒语义单位之间的不精确和有偏见的对应关系。为了解决此问题,我们为MNER提出了一个平坦的多模式相互作用变压器(FMIT)。具体而言,我们首先在句子和通用域单词中使用名词短语来获得视觉提示。然后,我们将视觉和文本的细颗粒语义表示变成统一的晶格结构,并设计一种新颖的相对位置编码以匹配变压器中的不同模态。同时,我们建议利用实体边界检测作为减轻视觉偏见的辅助任务。实验表明,我们的方法在两个基准数据集上实现了新的最新性能。
Multi-modal named entity recognition (MNER) aims at identifying entity spans and recognizing their categories in social media posts with the aid of images. However, in dominant MNER approaches, the interaction of different modalities is usually carried out through the alternation of self-attention and cross-attention or over-reliance on the gating machine, which results in imprecise and biased correspondence between fine-grained semantic units of text and image. To address this issue, we propose a Flat Multi-modal Interaction Transformer (FMIT) for MNER. Specifically, we first utilize noun phrases in sentences and general domain words to obtain visual cues. Then, we transform the fine-grained semantic representation of the vision and text into a unified lattice structure and design a novel relative position encoding to match different modalities in Transformer. Meanwhile, we propose to leverage entity boundary detection as an auxiliary task to alleviate visual bias. Experiments show that our methods achieve the new state-of-the-art performance on two benchmark datasets.