论文标题
几乎没有传动的跨语性转移,可用于粗粒的代码混合临床文本的识别
Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of Code-Mixed Clinical Texts
论文作者
论文摘要
尽管数字医疗系统提供了精心策划的结构化知识的进步,但许多关键信息仍然在大量未标记和非结构化的临床文本中。这些经常包含受保护的健康信息(PHI)的文本暴露于用于下游应用程序的信息提取工具,冒着患者识别的风险。现有的识别作品依赖于使用英语的大规模注释的语料库,通常不适合现实世界中的多语言环境。预训练的语言模型(LM)在低资源环境中表现出了跨语性转移的巨大潜力。在这项工作中,我们从经验上展示了LMS对命名实体识别(NER)的少量跨语性转移属性(NER),并将其应用于中风域中的代码混合(西班牙-CATALAN)临床注释的低资源和真实世界挑战。我们注释一个金评估数据集,以评估几个标记的示例进行培训,以评估几个射击设置的性能。我们的模型从Meddocan(Marimon等,2019)Corpus使用我们的少量跨语言目标公司(Marimon et et al。,2019)调整了多种语言Bert(Mbert)(Devlin等,2019)时,将零击F1得分从73.7%提高到91.2%。当概括为样本外测试集时,最佳模型将达到97.2%的人类评估F1得分。
Despite the advances in digital healthcare systems offering curated structured knowledge, much of the critical information still lies in large volumes of unlabeled and unstructured clinical texts. These texts, which often contain protected health information (PHI), are exposed to information extraction tools for downstream applications, risking patient identification. Existing works in de-identification rely on using large-scale annotated corpora in English, which often are not suitable in real-world multilingual settings. Pre-trained language models (LM) have shown great potential for cross-lingual transfer in low-resource settings. In this work, we empirically show the few-shot cross-lingual transfer property of LMs for named entity recognition (NER) and apply it to solve a low-resource and real-world challenge of code-mixed (Spanish-Catalan) clinical notes de-identification in the stroke domain. We annotate a gold evaluation dataset to assess few-shot setting performance where we only use a few hundred labeled examples for training. Our model improves the zero-shot F1-score from 73.7% to 91.2% on the gold evaluation set when adapting Multilingual BERT (mBERT) (Devlin et al., 2019) from the MEDDOCAN (Marimon et al., 2019) corpus with our few-shot cross-lingual target corpus. When generalized to an out-of-sample test set, the best model achieves a human-evaluation F1-score of 97.2%.