论文标题
重新思考《宪报》在中文命名实体识别中的价值
Rethinking the Value of Gazetteer in Chinese Named Entity Recognition
论文作者
论文摘要
Gazetteer被广泛用于中文命名实体识别(NER),以增强跨度边界检测和类型分类。但是,为了进一步理解Gazetteers的普遍性和有效性,NLP社区仍然缺乏对Gazetteer增强NER模型的系统分析。在本文中,我们首先重新审查了Gazetteer增强NER模型的几种常见实践,并进行了一系列详细的分析,以评估模型绩效与宪报特征之间的关系,这可以指导我们建立更合适的Gazetteer。本文的发现如下:(1)宪报改善了传统的NER模型数据集难以学习的大多数情况。 (2)模型的性能极大地受益于高质量的预训练的词汇嵌入。 (3)一个好的地名词典应涵盖更多在训练集和测试集中可以匹配的实体。
Gazetteer is widely used in Chinese named entity recognition (NER) to enhance span boundary detection and type classification. However, to further understand the generalizability and effectiveness of gazetteers, the NLP community still lacks a systematic analysis of the gazetteer-enhanced NER model. In this paper, we first re-examine the effectiveness several common practices of the gazetteer-enhanced NER models and carry out a series of detailed analysis to evaluate the relationship between the model performance and the gazetteer characteristics, which can guide us to build a more suitable gazetteer. The findings of this paper are as follows: (1) the gazetteer improves most of the situations that the traditional NER model datasets are difficult to learn. (2) the performance of model greatly benefits from the high-quality pre-trained lexeme embeddings. (3) a good gazetteer should cover more entities that can be matched in both the training set and testing set.