2012年有Covid-19吗？ AI在具有类似适应症的诊断中挑战

论文标题

2012年有Covid-19吗？ AI在具有类似适应症的诊断中挑战

Was there COVID-19 back in 2012? Challenge for AI in Diagnosis with Similar Indications

论文作者

Banerjee, Imon, Sinha, Priyanshu, Purkayastha, Saptarshi, Mashhaditafreshi, Nazanin, Tariq, Amara, Jeong, Jiwoong, Trivedi, Hari, Gichoya, Judy W.

论文摘要

目的：自最近的Covid-19爆发以来，已经有大量的研究论文将基于深度学习的图像处理应用于胸部X光片以检测疾病。测试CXR Covid-19在外部数据集上诊断的两个顶级模型的性能，以评估模型的通用性。方法：在本文中，我们介绍了关于现有深度学习模型在1900诊断中的效率和适用性的论点。我们提供了两种流行模型的结果-Covid-NET和Coronet在三个公开可用数据集以及2020年1月至5月之间从Emory医院收集的其他机构数据集进行了评估，其中包含使用RT-PCR测试的Covid-19感染的患者。结果：CHEXPERT（55.3％）和Mimic-CXR（23.4％）数据集的Covid-NET都有很大的假阳性率（FPR）。在Emory数据集上，Covid-NET具有61.4％的灵敏度，0.54 F1得分和0.49的精度值。与Covid-Net-Emory（9.1％），CHEXPERT（1.3％），ChestX-Ray14（0.02％），Mimic-CXR（0.06％）相比，所有数据集的冠冕模型的FPR显着降低。结论：这些模型在其内部数据集上报告了出色的性能，但是我们从测试中观察到他们的性能在外部数据上急剧恶化。这可能是由于缺乏适当的对照患者和地面真相标签而导致的几种原因，包括过度拟合模型。第四个机构数据集使用RT-PCR标记，这可能是正面的，而没有射线照相发现，反之亦然。因此，临床和射线照相数据的融合模型可能具有更好的性能和概括。

Purpose: Since the recent COVID-19 outbreak, there has been an avalanche of research papers applying deep learning based image processing to chest radiographs for detection of the disease. To test the performance of the two top models for CXR COVID-19 diagnosis on external datasets to assess model generalizability. Methods: In this paper, we present our argument regarding the efficiency and applicability of existing deep learning models for COVID-19 diagnosis. We provide results from two popular models - COVID-Net and CoroNet evaluated on three publicly available datasets and an additional institutional dataset collected from EMORY Hospital between January and May 2020, containing patients tested for COVID-19 infection using RT-PCR. Results: There is a large false positive rate (FPR) for COVID-Net on both ChexPert (55.3%) and MIMIC-CXR (23.4%) dataset. On the EMORY Dataset, COVID-Net has 61.4% sensitivity, 0.54 F1-score and 0.49 precision value. The FPR of the CoroNet model is significantly lower across all the datasets as compared to COVID-Net - EMORY(9.1%), ChexPert (1.3%), ChestX-ray14 (0.02%), MIMIC-CXR (0.06%). Conclusion: The models reported good to excellent performance on their internal datasets, however we observed from our testing that their performance dramatically worsened on external data. This is likely from several causes including overfitting models due to lack of appropriate control patients and ground truth labels. The fourth institutional dataset was labeled using RT-PCR, which could be positive without radiographic findings and vice versa. Therefore, a fusion model of both clinical and radiographic data may have better performance and generalization.

下载PDF全文

下载文献需遵守相关版权规定

论文标题