论文标题

对命名实体识别的严格研究:对预验证的模型可以导致应许的土地?

A Rigorous Study on Named Entity Recognition: Can Fine-tuning Pretrained Model Lead to the Promised Land?

论文作者

Lin, Hongyu, Lu, Yaojie, Tang, Jialong, Han, Xianpei, Sun, Le, Wei, Zhicheng, Yuan, Nicholas Jing

论文摘要

微调预审计的模型在标准NER基准测试中实现了有希望的性能。通常,这些基准具有强大的名称规律性,高度提及的覆盖范围和足够的背景多样性。不幸的是,当将NER扩展到打开情况时,这些优势可能不再存在。因此,它提出了一个关键的问题,即在面对这些挑战时,以前的可信度方法是否仍然可以很好地运作。由于目前尚无可用数据集来调查此问题,因此本文建议对标准基准进行随机测试。具体来说,我们从基准中分别删除了名称的规律性,分别提及覆盖范围和上下文多样性,以探索它们对模型概括能力的影响。为了进一步验证我们的结论,我们还构建了一个新的开放式NER数据集,该数据集专注于名称规律性较弱的实体类型和较低的提及覆盖范围以验证我们的结论。从随机测试和经验实验中,我们得出结论,即1)名称规律性对于模型至关重要的是概括地提到的; 2)高度提及的覆盖范围可能会破坏模型的概括能力,而3)上下文模式可能不需要使用预审慎编码器时捕获大量数据。

Fine-tuning pretrained model has achieved promising performance on standard NER benchmarks. Generally, these benchmarks are blessed with strong name regularity, high mention coverage and sufficient context diversity. Unfortunately, when scaling NER to open situations, these advantages may no longer exist. And therefore it raises a critical question of whether previous creditable approaches can still work well when facing these challenges. As there is no currently available dataset to investigate this problem, this paper proposes to conduct randomization test on standard benchmarks. Specifically, we erase name regularity, mention coverage and context diversity respectively from the benchmarks, in order to explore their impact on the generalization ability of models. To further verify our conclusions, we also construct a new open NER dataset that focuses on entity types with weaker name regularity and lower mention coverage to verify our conclusion. From both randomization test and empirical experiments, we draw the conclusions that 1) name regularity is critical for the models to generalize to unseen mentions; 2) high mention coverage may undermine the model generalization ability and 3) context patterns may not require enormous data to capture when using pretrained encoders.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源