缺失数据的迭代归因的并行计算策略的影响：关于MISSFOREST的案例研究

论文标题

缺失数据的迭代归因的并行计算策略的影响：关于MISSFOREST的案例研究

Influence of parallel computing strategies of iterative imputation of missing data: a case study on missForest

论文作者

Hong, Shangzhi, Sun, Yuqi, Li, Hanying, Lynn, Henry S.

论文摘要

机器学习迭代的归因方法已被研究人员公认，以归纳丢失的数据，但是在处理大型数据集时，它们可能会耗时。为了克服这一缺点，已经提出了并行计算策略，但它们对归合结果的影响和随后的统计分析相对尚不清楚。这项研究研究了在随机备用方法Missforest中实施的两种平行策略（可变分布式计算和模型分布式计算）。仿真实验的结果表明，这两种平行策略可以同时影响插补过程和最终插补结果。具体而言，即使两种策略都产生了类似的归一化均方根预测错误，但在估计协变量的平均值和相互关系及其回归系数时，可变分布式策略会导致额外的偏见。

Machine learning iterative imputation methods have been well accepted by researchers for imputing missing data, but they can be time-consuming when handling large datasets. To overcome this drawback, parallel computing strategies have been proposed but their impact on imputation results and subsequent statistical analyses are relatively unknown. This study examines the two parallel strategies (variable-wise distributed computation and model-wise distributed computation) implemented in the random-forest imputation method, missForest. Results from the simulation experiments showed that the two parallel strategies can influence both the imputation process and the final imputation results differently. Specifically, even though both strategies produced similar normalized root mean squared prediction errors, the variable-wise distributed strategy led to additional biases when estimating the mean and inter-correlation of the covariates and their regression coefficients.

下载PDF全文

下载文献需遵守相关版权规定

论文标题