论文标题
GGPONC:基于临床实践指南
GGPONC: A Corpus of German Medical Text with Rich Metadata Based on Clinical Practice Guidelines
论文作者
论文摘要
缺乏公开访问的文本语料库是自然语言处理进展的主要障碍。不幸的是,对于医疗应用,除英语以外的所有语言社区都是低资源的。在这项工作中,我们提出了GGPONC(肿瘤学NLP语料库的德国指南计划),这是一种基于肿瘤学临床实践指南的自由分发的德语语料库。该语料库是有史以来最大的德国医疗文件之一。与临床文件不同,临床指南不包含任何与患者有关的信息,因此可以在没有数据保护限制的情况下使用。此外,GGPONC是涵盖大型医学子场中不同条件的德语语料库,并提供了各种元数据,例如文献参考和证据水平。通过将现有的医学信息提取管道应用于德语文本,我们可以对其他医学语言,医学和非医学语言的使用进行比较。
The lack of publicly accessible text corpora is a major obstacle for progress in natural language processing. For medical applications, unfortunately, all language communities other than English are low-resourced. In this work, we present GGPONC (German Guideline Program in Oncology NLP Corpus), a freely distributable German language corpus based on clinical practice guidelines for oncology. This corpus is one of the largest ever built from German medical documents. Unlike clinical documents, clinical guidelines do not contain any patient-related information and can therefore be used without data protection restrictions. Moreover, GGPONC is the first corpus for the German language covering diverse conditions in a large medical subfield and provides a variety of metadata, such as literature references and evidence levels. By applying and evaluating existing medical information extraction pipelines for German text, we are able to draw comparisons for the use of medical language to other corpora, medical and non-medical ones.