标签融合和训练方法可靠地表示评估者间的不确定性

论文标题

标签融合和训练方法可靠地表示评估者间的不确定性

Label fusion and training methods for reliable representation of inter-rater uncertainty

论文作者

Lemay, Andreanne, Gros, Charley, Karthik, Enamundram Naga, Cohen-Adad, Julien

论文摘要

由于多种因素，例如图像质量，专业经验和培训或准则清晰度，医疗任务容易发生评估者之间的变异性。通过多个评估者的注释培训深度学习网络是一种常见的做法，可以减轻模型对单个专家的偏见。可靠的模型产生校准的输出并反映评估者间分歧是临床实践中人工智能整合的关键。存在各种考虑不同专家标签的方法。我们专注于比较三种标签融合方法：主食，评估者的平均分段以及在训练过程中对每个评估者分割的随机采样。使用常规训练框架和最近发布的软索引框架研究了每个标签融合方法，该框架通过将分割任务视为回归来限制信息丢失。在两个公共数据集上的10个数据分组中，我们的结果表明，与传统的同行相比，SoftSeg模型，无论是地面真相融合方法如何，都可以更好地校准和保存评估者评估者的变异性，而不会影响细分性能。常规模型，即经过骰子损失的训练，二进制输入和Sigmoid/SoftMax最终激活，被过度自信，并低估了与评分者间可变性相关的不确定性。相反，通过将标签与SoftSeg框架进行融合，导致了自信的输出不足和评估者分歧的高估。在分割性能方面，所研究的两个数据集的最佳标签融合方法是不同的，表明此参数可能取决于任务。但是，SoftSeg的分割性能在系统上具有优越或等于经过传统训练的模型，并且具有最佳的校准和评估者间变异性。

Medical tasks are prone to inter-rater variability due to multiple factors such as image quality, professional experience and training, or guideline clarity. Training deep learning networks with annotations from multiple raters is a common practice that mitigates the model's bias towards a single expert. Reliable models generating calibrated outputs and reflecting the inter-rater disagreement are key to the integration of artificial intelligence in clinical practice. Various methods exist to take into account different expert labels. We focus on comparing three label fusion methods: STAPLE, average of the rater's segmentation, and random sampling of each rater's segmentation during training. Each label fusion method is studied using both the conventional training framework and the recently published SoftSeg framework that limits information loss by treating the segmentation task as a regression. Our results, across 10 data splittings on two public datasets, indicate that SoftSeg models, regardless of the ground truth fusion method, had better calibration and preservation of the inter-rater rater variability compared with their conventional counterparts without impacting the segmentation performance. Conventional models, i.e., trained with a Dice loss, with binary inputs, and sigmoid/softmax final activate, were overconfident and underestimated the uncertainty associated with inter-rater variability. Conversely, fusing labels by averaging with the SoftSeg framework led to underconfident outputs and overestimation of the rater disagreement. In terms of segmentation performance, the best label fusion method was different for the two datasets studied, indicating this parameter might be task-dependent. However, SoftSeg had segmentation performance systematically superior or equal to the conventionally trained models and had the best calibration and preservation of the inter-rater variability.

下载PDF全文

下载文献需遵守相关版权规定

论文标题