论文标题
骨关节炎成像膝关节MRI分割挑战的国际研讨会:标准化数据集的多名评估和分析框架
The International Workshop on Osteoarthritis Imaging Knee MRI Segmentation Challenge: A Multi-Institute Evaluation and Analysis Framework on a Standardized Dataset
论文作者
论文摘要
目的:组织膝关节MRI分割挑战,以表征与监测骨关节炎进展相关的自动分割方法的语义和临床功效。 方法:一个由两个时间点的3D膝关节MRI组成的数据集分区,标准化了地面关节(股骨,胫骨,tell骨)软骨和半月板分段的数据集分区。使用骰子得分,平均对称的表面距离,体积重叠误差以及持有测试集的变化系数评估挑战提交和多数票方合奏。使用成对骰子相关性评估网络分割的相似性。按照扫描和纵向计算关节软骨厚度。使用Pearson的系数测量厚度误差与分割指标之间的相关性。使用模型输出的组合计算了合并真实阳性和真实负面因素的模型输出的组合,计算了两个经验上限。 结果:六支球队(T1-T6)提交了挑战的条目。在四个表现最好的网络(T2,T3,T4,T6)中,所有组织的所有分割指标均未观察到明显的差异(p = 1.0)。网络对之间的骰子相关性很高(> 0.85)。在T1-T4之间,每扫阳性厚度误差可忽略不计(P = 0.99),纵向变化显示最小的偏置(<0.03mm)。在分割指标和厚度误差之间观察到低相关性(<0.41)。大多数票数合奏与顶级性能网络相当(p = 1.0)。两种组合的经验上限性能相似(p = 1.0)。 结论:不同的网络学会了分割膝盖,同样,高分割精度与软骨厚度精度无关。投票合奏并没有胜过单个网络,但可能有助于使单个模型正规化。
Purpose: To organize a knee MRI segmentation challenge for characterizing the semantic and clinical efficacy of automatic segmentation methods relevant for monitoring osteoarthritis progression. Methods: A dataset partition consisting of 3D knee MRI from 88 subjects at two timepoints with ground-truth articular (femoral, tibial, patellar) cartilage and meniscus segmentations was standardized. Challenge submissions and a majority-vote ensemble were evaluated using Dice score, average symmetric surface distance, volumetric overlap error, and coefficient of variation on a hold-out test set. Similarities in network segmentations were evaluated using pairwise Dice correlations. Articular cartilage thickness was computed per-scan and longitudinally. Correlation between thickness error and segmentation metrics was measured using Pearson's coefficient. Two empirical upper bounds for ensemble performance were computed using combinations of model outputs that consolidated true positives and true negatives. Results: Six teams (T1-T6) submitted entries for the challenge. No significant differences were observed across all segmentation metrics for all tissues (p=1.0) among the four top-performing networks (T2, T3, T4, T6). Dice correlations between network pairs were high (>0.85). Per-scan thickness errors were negligible among T1-T4 (p=0.99) and longitudinal changes showed minimal bias (<0.03mm). Low correlations (<0.41) were observed between segmentation metrics and thickness error. The majority-vote ensemble was comparable to top performing networks (p=1.0). Empirical upper bound performances were similar for both combinations (p=1.0). Conclusion: Diverse networks learned to segment the knee similarly where high segmentation accuracy did not correlate to cartilage thickness accuracy. Voting ensembles did not outperform individual networks but may help regularize individual models.