电子健康记录中文本分类方法的比较分析

论文标题

电子健康记录中文本分类方法的比较分析

Comparative Analysis of Text Classification Approaches in Electronic Health Records

论文作者

Mascio, Aurelie, Kraljevic, Zeljko, Bean, Daniel, Dobson, Richard, Stewart, Robert, Bendayan, Rebecca, Roberts, Angus

论文摘要

旨在从电子健康记录中收获和/或组织信息的文本分类任务对于支持临床和转化研究至关重要。但是，与其他分类任务相比，这些当前的特定挑战尤其是由于医疗词典的特殊性质和临床记录中使用的语言。嵌入方法的最新进展显示了几项临床任务的有希望的结果，但是这种方法与其他常用的单词表示和分类模型没有详尽的比较。在这项工作中，我们分析了各种单词表示，文本预处理和分类算法对四个不同文本分类任务的性能的影响。结果表明，在根据分类任务固有的特定语言和结构量身定制的传统方法可以根据上下文嵌入（例如bert）实现或超过近期的性能。

Text classification tasks which aim at harvesting and/or organizing information from electronic health records are pivotal to support clinical and translational research. However these present specific challenges compared to other classification tasks, notably due to the particular nature of the medical lexicon and language used in clinical records. Recent advances in embedding methods have shown promising results for several clinical tasks, yet there is no exhaustive comparison of such approaches with other commonly used word representations and classification models. In this work, we analyse the impact of various word representations, text pre-processing and classification algorithms on the performance of four different text classification tasks. The results show that traditional approaches, when tailored to the specific language and structure of the text inherent to the classification task, can achieve or exceed the performance of more recent ones based on contextual embeddings such as BERT.

下载PDF全文

下载文献需遵守相关版权规定

论文标题