论文标题
ddxplus:一个用于自动医学诊断的新数据集
DDXPlus: A New Dataset For Automatic Medical Diagnosis
论文作者
论文摘要
在机器学习研究文献中,人们对自动症状检测(ASD)和自动诊断(AD)系统的兴趣迅速增强,旨在帮助医生进行远程医疗服务。这些系统旨在与患者相互作用,收集有关其症状和相关前因的证据,并可能对潜在疾病做出预测。医生将审查互动,包括证据和预测,如有必要,请在确定下一步之前从患者那里收集其他信息。尽管该领域最近取得了进展,但这些系统的设计中缺少了重要的医生与患者的互动,即鉴别诊断。它的缺席很大程度上是由于缺乏包含此类信息供模型进行训练的数据集。在这项工作中,我们提出了一个大约130万患者的大规模合成数据集,其中包括鉴别诊断,以及每个患者的地面真相病理学,症状和前因。与仅包含二进制症状和先例的现有数据集不同,该数据集还包含对有效数据收集有用的绝对选择和多选择症状和先决条件。此外,某些症状是在层次结构中组织的,使设计系统可以以逻辑方式与患者互动。作为概念验证,我们扩展了两个现有的AD和ASD系统以纳入鉴别诊断,并提供了经验证据,即使用差异作为培训信号对于此类系统的效率或帮助医生更好地了解这些系统的推理至关重要。
There has been a rapidly growing interest in Automatic Symptom Detection (ASD) and Automatic Diagnosis (AD) systems in the machine learning research literature, aiming to assist doctors in telemedicine services. These systems are designed to interact with patients, collect evidence about their symptoms and relevant antecedents, and possibly make predictions about the underlying diseases. Doctors would review the interactions, including the evidence and the predictions, collect if necessary additional information from patients, before deciding on next steps. Despite recent progress in this area, an important piece of doctors' interactions with patients is missing in the design of these systems, namely the differential diagnosis. Its absence is largely due to the lack of datasets that include such information for models to train on. In this work, we present a large-scale synthetic dataset of roughly 1.3 million patients that includes a differential diagnosis, along with the ground truth pathology, symptoms and antecedents for each patient. Unlike existing datasets which only contain binary symptoms and antecedents, this dataset also contains categorical and multi-choice symptoms and antecedents useful for efficient data collection. Moreover, some symptoms are organized in a hierarchy, making it possible to design systems able to interact with patients in a logical way. As a proof-of-concept, we extend two existing AD and ASD systems to incorporate the differential diagnosis, and provide empirical evidence that using differentials as training signals is essential for the efficiency of such systems or for helping doctors better understand the reasoning of those systems.