论文标题
可解释的模型,能够处理不平衡类和异质数据集中系统丢失的模型
Interpretable Models Capable of Handling Systematic Missingness in Imbalanced Classes and Heterogeneous Datasets
论文作者
论文摘要
在医疗数据集上的可解释机器学习技术的应用有助于早期和快速诊断,并深入了解数据。此外,这些模型的透明度增加了应用领域专家之间的信任。医疗数据集面临着常见问题,例如异质测量,样本量有限的不平衡类别以及缺少数据,这阻碍了机器学习技术的直接应用。在本文中,我们介绍了一个基于原型的(PB)可解释的模型,该模型能够处理这些问题。此贡献中引入的模型显示出与适用于这种情况的替代技术相当或出色的性能。但是,与必须妥协的基于合奏的模型不同,这里的PB模型也不是。此外,我们提出了一种策略,即通过平均模型参数歧管来维护PB模型的内在解释性,同时维持合奏的力量。除了对两个现实世界医学数据集的详细分析(一个公开可用)外,对所有模型进行了对合成(公开可用数据集)的评估。结果表明,我们引入的模型和策略解决了现实世界中医学数据的挑战,同时保持计算廉价且透明,并且与替代方案相比,性能相似或优越。
Application of interpretable machine learning techniques on medical datasets facilitate early and fast diagnoses, along with getting deeper insight into the data. Furthermore, the transparency of these models increase trust among application domain experts. Medical datasets face common issues such as heterogeneous measurements, imbalanced classes with limited sample size, and missing data, which hinder the straightforward application of machine learning techniques. In this paper we present a family of prototype-based (PB) interpretable models which are capable of handling these issues. The models introduced in this contribution show comparable or superior performance to alternative techniques applicable in such situations. However, unlike ensemble based models, which have to compromise on easy interpretation, the PB models here do not. Moreover we propose a strategy of harnessing the power of ensembles while maintaining the intrinsic interpretability of the PB models, by averaging the model parameter manifolds. All the models were evaluated on a synthetic (publicly available dataset) in addition to detailed analyses of two real-world medical datasets (one publicly available). Results indicated that the models and strategies we introduced addressed the challenges of real-world medical data, while remaining computationally inexpensive and transparent, as well as similar or superior in performance compared to their alternatives.