论文标题
PMC患者:患者摘要的大规模数据集和基于基于检索的临床决策支持系统的关系
PMC-Patients: A Large-scale Dataset of Patient Summaries and Relations for Benchmarking Retrieval-based Clinical Decision Support Systems
论文作者
论文摘要
目的:基于检索的临床决策支持(RECD)可以通过为给定患者提供相关文献和类似患者来帮助临床工作流程。但是,由于缺乏多样化的患者收集和公开可用的大规模患者级注释数据集,RECDS系统的开发受到了严重阻碍。在本文中,我们旨在定义和基准测试两个RECDS任务:使用称为PMC患者的新型数据集使用新型数据集。方法:我们使用简单的启发式方法从PubMed Central文章中提取患者摘要,并利用PubMed引文图来定义患者库的相关性和患者患者相似性。我们还在PMC患者基准测试中实施和评估了多个RECDS系统,包括稀疏的猎犬,密集的猎犬和最近的邻居回收者。我们进行了几项案例研究,以显示PMC患者的临床实用性。结果:PMC患者包含167K患者摘要,具有310万名患者与患者相关性注释和293K患者患者相似性注释,这是RECD的最大尺度资源,也是最大的患者收集之一。人类评估和分析表明,PMC患者是具有高质量注释的多样化数据集。对各种RECDS系统的评估表明,PMC患者的基准具有挑战性,需要进一步研究。结论:我们介绍了PMC患者,这是一个大规模,多样化且可公开的患者摘要数据集,其中最大的患者级别关系注释。基于PMC患者,我们正式为RECDS系统定义了两个基准任务,并评估了各种现有的检索方法。 PMC患者在很大程度上可以促进RECDS系统的方法研究,并显示现实世界中的临床实用性。
Objective: Retrieval-based Clinical Decision Support (ReCDS) can aid clinical workflow by providing relevant literature and similar patients for a given patient. However, the development of ReCDS systems has been severely obstructed by the lack of diverse patient collections and publicly available large-scale patient-level annotation datasets. In this paper, we aim to define and benchmark two ReCDS tasks: Patient-to-Article Retrieval (ReCDS-PAR) and Patient-to-Patient Retrieval (ReCDS-PPR) using a novel dataset called PMC-Patients. Methods: We extract patient summaries from PubMed Central articles using simple heuristics and utilize the PubMed citation graph to define patient-article relevance and patient-patient similarity. We also implement and evaluate several ReCDS systems on the PMC-Patients benchmarks, including sparse retrievers, dense retrievers, and nearest neighbor retrievers. We conduct several case studies to show the clinical utility of PMC-Patients. Results: PMC-Patients contains 167k patient summaries with 3.1M patient-article relevance annotations and 293k patient-patient similarity annotations, which is the largest-scale resource for ReCDS and also one of the largest patient collections. Human evaluation and analysis show that PMC-Patients is a diverse dataset with high-quality annotations. The evaluation of various ReCDS systems shows that the PMC-Patients benchmark is challenging and calls for further research. Conclusion: We present PMC-Patients, a large-scale, diverse, and publicly available patient summary dataset with the largest-scale patient-level relation annotations. Based on PMC-Patients, we formally define two benchmark tasks for ReCDS systems and evaluate various existing retrieval methods. PMC-Patients can largely facilitate methodology research on ReCDS systems and shows real-world clinical utility.