论文标题

苏打水:一种自然语言处理包,用于提取癌症研究健康的社会决定因素

SODA: A Natural Language Processing Package to Extract Social Determinants of Health for Cancer Studies

论文作者

Yu, Zehao, Yang, Xi, Dang, Chong, Adekkanattu, Prakash, Patra, Braja Gopal, Peng, Yifan, Pathak, Jyotishman, Wilson, Debbie L., Chang, Ching-Yuan, Lo-Ciganic, Wei-Hsuan, George, Thomas J., Hogan, William R., Guo, Yi, Bian, Jiang, Wu, Yonghui

论文摘要

目的:我们旨在开发开源自然语言处理(NLP)软件包(即社会决定因素),并采用预先培训的变压器模型来提取癌症患者健康的社会决定因素(SDOH),检查苏打水对新的疾病领域的推广性(即,使用opoile)(即,使用SDO癌症)。 方法:我们确定了SDOH类别和属性,并使用一般癌症队列的临床注释开发了SDOH语料库。我们比较了四个基于变压器的NLP模型来提取SDOH,研究了NLP模型对与阿片类药物开具的患者的概括性的普遍性,并探索了定制策略以提高性能。我们将最佳的NLP模型应用于从乳房(n = 7,971),肺(n = 11,804)和结直肠癌(n = 6,240)同类中提取19类SDOH。 结果和结论:我们开发了629名癌症患者的语料库,注释为SDOH的19个类别的注释为13,193个SDOH概念/属性。来自变形金刚(BERT)模型的双向编码器表示,用于SDOH概念提取的最佳严格/宽松F1分数为0.9216和0.9441,将属性链接到SDOH概念,为0.9617和0.9626。使用阿片类药物使用的新注释对NLP模型进行微调,将严格/宽松的F1分数从0.8172/0.8502提高到0.8312/0.8679。 19种SDOH类别中的提取率差异很大,其中10个SDOH可以从> 70%的癌症患者中提取,但9 SDOH的提取率较低(<70%的癌症患者)。带有预训练变压器模型的苏打套件可在https://github.com/uf-hobiinformatics-lab/sdoh_soda上公开获得。

Objective: We aim to develop an open-source natural language processing (NLP) package, SODA (i.e., SOcial DeterminAnts), with pre-trained transformer models to extract social determinants of health (SDoH) for cancer patients, examine the generalizability of SODA to a new disease domain (i.e., opioid use), and evaluate the extraction rate of SDoH using cancer populations. Methods: We identified SDoH categories and attributes and developed an SDoH corpus using clinical notes from a general cancer cohort. We compared four transformer-based NLP models to extract SDoH, examined the generalizability of NLP models to a cohort of patients prescribed with opioids, and explored customization strategies to improve performance. We applied the best NLP model to extract 19 categories of SDoH from the breast (n=7,971), lung (n=11,804), and colorectal cancer (n=6,240) cohorts. Results and Conclusion: We developed a corpus of 629 cancer patients notes with annotations of 13,193 SDoH concepts/attributes from 19 categories of SDoH. The Bidirectional Encoder Representations from Transformers (BERT) model achieved the best strict/lenient F1 scores of 0.9216 and 0.9441 for SDoH concept extraction, 0.9617 and 0.9626 for linking attributes to SDoH concepts. Fine-tuning the NLP models using new annotations from opioid use patients improved the strict/lenient F1 scores from 0.8172/0.8502 to 0.8312/0.8679. The extraction rates among 19 categories of SDoH varied greatly, where 10 SDoH could be extracted from >70% of cancer patients, but 9 SDoH had a low extraction rate (<70% of cancer patients). The SODA package with pre-trained transformer models is publicly available at https://github.com/uf-hobiinformatics-lab/SDoH_SODA.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源