论文标题

使用链式随机森林的多个插补:基于脱落外预测误差的经验分布的初步研究

Multiple imputation using chained random forests: a preliminary study based on the empirical distribution of out-of-bag prediction errors

论文作者

Hong, Shangzhi, Sun, Yuqi, Li, Hanying, Lynn, Henry S.

论文摘要

丢失的数据在生物医学领域的数据分析中很常见,基于随机森林(RF)的插补方法已被广泛接受,因为RF算法可以实现高精度,而无需规定数据分布或关系。但是,RF的预测不包含有关预测不确定性的信息,这对于多次插补是不可接受的。可用的基于RF的多个插补方法试图通过直接从预测节点下的观测值进行取样而不考虑预测错误或通过对预测误差分布做出正态性假设来进行适当的多重插补。在这项研究中,提出了一种基于RF的新型多重插补方法,该方法是通过构建条件分布的隔离外预测误差的经验分布而提出的。将所提出的方法与先前的方法进行了对RF预测误差和预测平均匹配的参数假设的比较,该方法基于对具有相互作用项的数据的仿真研究。提出的非参数方法可以提供有效的多个插补结果。此研究的随附的R包可以公开使用。

Missing data are common in data analyses in biomedical fields, and imputation methods based on random forests (RF) have become widely accepted, as the RF algorithm can achieve high accuracy without the need for specification of data distributions or relationships. However, the predictions from RF do not contain information about prediction uncertainty, which was unacceptable for multiple imputation. Available RF-based multiple imputation methods tried to do proper multiple imputation either by sampling directly from observations under predicting nodes without accounting for the prediction error or by making normality assumption about the prediction error distribution. In this study, a novel RF-based multiple imputation method was proposed by constructing conditional distributions the empirical distribution of out-of-bag prediction errors. The proposed method was compared with previous method with parametric assumptions about RF's prediction errors and predictive mean matching based on simulation studies on data with presence of interaction term. The proposed non-parametric method can deliver valid multiple imputation results. The accompanying R package for this study is publicly available.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源