论文标题
处理决策树中缺失的数据:一种概率方法
Handling Missing Data in Decision Trees: A Probabilistic Approach
论文作者
论文摘要
决策树是一个流行的模型家族,因为它们具有吸引人的特性,例如解释性和处理异质数据的能力。同时,缺少数据是一种妨碍机器学习模型的性能的普遍发生。因此,在决策树中处理丢失的数据是一个精心研究的问题。在本文中,我们通过采用概率方法来解决这个问题。在部署时间,我们使用可拖动的密度估计器来计算模型的“预期预测”。在学习时,我们通过最大程度地减少其“预期预测损失” W.R.T. \我们的密度估计器来微调已经学习过的树的参数。与几乎没有基线相比,我们提供了简短的实验,展示了我们方法的有效性。
Decision trees are a popular family of models due to their attractive properties such as interpretability and ability to handle heterogeneous data. Concurrently, missing data is a prevalent occurrence that hinders performance of machine learning models. As such, handling missing data in decision trees is a well studied problem. In this paper, we tackle this problem by taking a probabilistic approach. At deployment time, we use tractable density estimators to compute the "expected prediction" of our models. At learning time, we fine-tune parameters of already learned trees by minimizing their "expected prediction loss" w.r.t.\ our density estimators. We provide brief experiments showcasing effectiveness of our methods compared to few baselines.