论文标题
成本质量的自适应主动学习,用于中国临床命名实体识别
Cost-Quality Adaptive Active Learning for Chinese Clinical Named Entity Recognition
论文作者
论文摘要
临床命名实体识别(CNER)的目的是在电子健康记录(EHRS)中自动识别临床术语,这是临床研究的基本和关键步骤。要训练CNER的高性能模型,通常需要大量具有高质量标签的EHR。但是,标记EHR,尤其是中国EHR,既耗时又昂贵。一个有效的解决方案是主动学习,模型要求标签者注释模型不确定的数据。传统的主动学习假设一个单个标签始终回答对查询标签的无声答案。但是,在实际设置中,多个标签者提供各种注释的质量,成本各不相同,标签质量较低,总体注释质量仍然可以为某些特定实例分配正确的标签。在本文中,我们提出了中国EHR中CNER的成本质量自适应主动学习方法(CQAAL)方法,该方法在注释质量,标签成本和选定实例的信息性之间保持平衡。具体而言,CQAAL选择具有成本效益的实例标签对,以自适应方式以较低的成本实现更好的注释质量。 CCKS-2017任务上的计算结果2基准数据集证明了所提出的CQAAL的优势和有效性。
Clinical Named Entity Recognition (CNER) aims to automatically identity clinical terminologies in Electronic Health Records (EHRs), which is a fundamental and crucial step for clinical research. To train a high-performance model for CNER, it usually requires a large number of EHRs with high-quality labels. However, labeling EHRs, especially Chinese EHRs, is time-consuming and expensive. One effective solution to this is active learning, where a model asks labelers to annotate data which the model is uncertain of. Conventional active learning assumes a single labeler that always replies noiseless answers to queried labels. However, in real settings, multiple labelers provide diverse quality of annotation with varied costs and labelers with low overall annotation quality can still assign correct labels for some specific instances. In this paper, we propose a Cost-Quality Adaptive Active Learning (CQAAL) approach for CNER in Chinese EHRs, which maintains a balance between the annotation quality, labeling costs, and the informativeness of selected instances. Specifically, CQAAL selects cost-effective instance-labeler pairs to achieve better annotation quality with lower costs in an adaptive manner. Computational results on the CCKS-2017 Task 2 benchmark dataset demonstrate the superiority and effectiveness of the proposed CQAAL.