基于大量观察数据的患者级预测的逻辑回归模型：我们是否需要所有数据？

论文标题

基于大量观察数据的患者级预测的逻辑回归模型：我们是否需要所有数据？

Logistic regression models for patient-level prediction based on massive observational data: Do we need all data?

论文作者

John, Luis H., Kors, Jan A., Reps, Jenna M., Ryan, Patrick B., Rijnbeek, Peter R.

论文摘要

目的：通过经验建立适当的样本量来为开发预测模型提供样本量考虑因素提供指导，这平衡了改善模型性能和降低模型复杂性以及计算要求的竞争目标。材料和方法：我们通过在三个大型观察健康数据库中生成81个预测问题的学习曲线（在抑郁症队列中预测的23个结果，在高血压队列中预测的58个结果），从经验上评估样本量对预测性能和模型复杂性的影响，需要对17,248个预测模型的培训。足够的样本量定义为模型的性能等于最大模型性能的样本量减去小阈值值。结果：适当的样本量分别降低了0.001、0.005、0.01和0.02的观测值的中位数减少9.5％，37.3％，58.5％和78.5％。模型中预测因子数量的中位数分别为0.001、0.005、0.01和0.02的阈值分别为8.6％，32.2％，48.2％和68.3％。讨论：根据我们的结果，可以估计未来预测工作的保守但显着降低样本量和模型复杂性。但是，如果研究人员愿意产生学习曲线，则可能会像大致相关的变异性那样暗示模型复杂性的更大降低。结论：我们的结果表明，在大多数情况下，只有一小部分可用数据足以产生接近完整数据集上开发的模型的模型，但大大降低了模型的复杂性。

Objective: Provide guidance on sample size considerations for developing predictive models by empirically establishing the adequate sample size, which balances the competing objectives of improving model performance and reducing model complexity as well as computational requirements. Materials and Methods: We empirically assess the effect of sample size on prediction performance and model complexity by generating learning curves for 81 prediction problems (23 outcomes predicted in a depression cohort, 58 outcomes predicted in a hypertension cohort) in three large observational health databases, requiring training of 17,248 prediction models. The adequate sample size was defined as the sample size for which the performance of a model equalled the maximum model performance minus a small threshold value. Results: The adequate sample size achieves a median reduction of the number of observations of 9.5%, 37.3%, 58.5%, and 78.5% for the thresholds of 0.001, 0.005, 0.01, and 0.02, respectively. The median reduction of the number of predictors in the models was 8.6%, 32.2%, 48.2%, and 68.3% for the thresholds of 0.001, 0.005, 0.01, and 0.02, respectively. Discussion: Based on our results a conservative, yet significant, reduction in sample size and model complexity can be estimated for future prediction work. Though, if a researcher is willing to generate a learning curve a much larger reduction of the model complexity may be possible as suggested by a large outcome-dependent variability. Conclusion: Our results suggest that in most cases only a fraction of the available data was sufficient to produce a model close to the performance of one developed on the full data set, but with a substantially reduced model complexity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题