广泛的错误分析和对医疗实体识别系统的基于学习的评估，以近似用户体验

论文标题

广泛的错误分析和对医疗实体识别系统的基于学习的评估，以近似用户体验

Extensive Error Analysis and a Learning-Based Evaluation of Medical Entity Recognition Systems to Approximate User Experience

论文作者

Nejadgholi, Isar, Fraser, Kathleen C., De Bruijn, Berry

论文摘要

当将医疗实体识别系统与测试集的金标准注释进行比较时，可能发生两种类型的不匹配，标签不匹配或跨度不匹配。在这里，我们专注于跨度不匹配，并表明由于跨度注释的主观性，其严重性从严重的错误到完全可接受的实体提取。对于基于域的BERT NER系统，我们表明25％的错误具有相同的标签和与黄金标准实体的重叠跨度。我们收集了专家判断，该判断显示了这些不匹配的90％以上被用户接受或部分接受。使用NER系统的培训集，我们构建了一个快速轻巧的实体分类器，以通过接受或拒绝它们来近似此类不匹配的用户体验。该分类器做出的决定用于计算基于学习的F-SCORE，该评分与放松的F-SCORE相比，这证明是对宽恕用户体验的更好近似值。我们证明了将提议的评估指标应用于使用两个数据集训练的各种深度学习医学实体识别模型的结果。

When comparing entities extracted by a medical entity recognition system with gold standard annotations over a test set, two types of mismatches might occur, label mismatch or span mismatch. Here we focus on span mismatch and show that its severity can vary from a serious error to a fully acceptable entity extraction due to the subjectivity of span annotations. For a domain-specific BERT-based NER system, we showed that 25% of the errors have the same labels and overlapping span with gold standard entities. We collected expert judgement which shows more than 90% of these mismatches are accepted or partially accepted by the user. Using the training set of the NER system, we built a fast and lightweight entity classifier to approximate the user experience of such mismatches through accepting or rejecting them. The decisions made by this classifier are used to calculate a learning-based F-score which is shown to be a better approximation of a forgiving user's experience than the relaxed F-score. We demonstrated the results of applying the proposed evaluation metric for a variety of deep learning medical entity recognition models trained with two datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题