论文标题
评估计算机断层扫描深度学习分类器中的重要性估计器
Evaluation of importance estimators in deep learning classifiers for Computed Tomography
论文作者
论文摘要
深度学习在检测物体和对图像进行分类方面表现出了极佳的性能,从而确保了分析医学成像的巨大希望。将深度学习的成功转化为医学成像,医生需要了解基本过程,就需要能够解释和解释神经网络的预测。深度神经网络的可解释性通常依赖于估计输入特征(例如像素)对结果的重要性(例如,类概率)。但是,已经开发了许多重要的估计器(也称为显着图),目前尚不清楚哪些与医学成像应用更相关。在目前的工作中,我们使用三个不同的评估指标调查了几个重要性估计器在解释计算层析成像(CT)图像分类时的性能。首先,当某些输入受到干扰时,以模型为中心的保真度测量模型准确性的降低。其次,重要性得分与专家定义的分割面罩之间的一致性是通过接收器操作特征(ROC)曲线在像素级别测量的。第三,我们通过骰子相似性系数(DSC)来测量基于XRAI的映射和分割掩模之间的区域重叠。总体而言,Smoothgrad的两个版本都超过了Fidelity和ROC排名,而集成梯度和SmoothGrad在DSC评估中都表现出色。有趣的是,以模型为中心(Fidelity)和以人为中心(ROC和DSC)评估之间存在严重的差异。嵌入在细分图中的专家期望和直觉并不一定与模型如何达到其预测。了解这种解释性差异将有助于利用医学深度学习的力量。
Deep learning has shown superb performance in detecting objects and classifying images, ensuring a great promise for analyzing medical imaging. Translating the success of deep learning to medical imaging, in which doctors need to understand the underlying process, requires the capability to interpret and explain the prediction of neural networks. Interpretability of deep neural networks often relies on estimating the importance of input features (e.g., pixels) with respect to the outcome (e.g., class probability). However, a number of importance estimators (also known as saliency maps) have been developed and it is unclear which ones are more relevant for medical imaging applications. In the present work, we investigated the performance of several importance estimators in explaining the classification of computed tomography (CT) images by a convolutional deep network, using three distinct evaluation metrics. First, the model-centric fidelity measures a decrease in the model accuracy when certain inputs are perturbed. Second, concordance between importance scores and the expert-defined segmentation masks is measured on a pixel level by a receiver operating characteristic (ROC) curves. Third, we measure a region-wise overlap between a XRAI-based map and the segmentation mask by Dice Similarity Coefficients (DSC). Overall, two versions of SmoothGrad topped the fidelity and ROC rankings, whereas both Integrated Gradients and SmoothGrad excelled in DSC evaluation. Interestingly, there was a critical discrepancy between model-centric (fidelity) and human-centric (ROC and DSC) evaluation. Expert expectation and intuition embedded in segmentation maps does not necessarily align with how the model arrived at its prediction. Understanding this difference in interpretability would help harnessing the power of deep learning in medicine.