论文标题

雪松:回归的沟通有效分布分析

CEDAR: Communication Efficient Distributed Analysis for Regressions

论文作者

Chang, Changgee, Bu, Zhiqi, Long, Qi

论文摘要

电子健康记录(EHRS)为推进精密医学提供了巨大的承诺,同时提出了重大的分析挑战。特别是,由于政府法规和/或机构政策,通常无法在机构(数据源)之间共享EHR中的患者级数据。结果,在不共享患者级数据的情况下,对在多个EHR数据库中分布学习的兴趣越来越大。为了应对此类挑战,我们提出了一种新颖的沟通高效方法,该方法通过将问题转变为缺失的数据问题来汇总本地最佳估计。此外,我们建议将远程站点的后验样本合并,这些样本可以提供有关缺失数量的部分信息,并提高参数估计的效率,同时具有差异隐私属性,从而降低信息泄漏的风险。提出的方法在不共享原始患者级别数据的情况下可以进行适当的统计推断,并可以适应稀疏的回归。我们为统计推断和差异隐私的提议方法的渐近性质提供了理论研究,并根据几种最近开发的方法评估了其在模拟和实际数据分析中的性能。

Electronic health records (EHRs) offer great promises for advancing precision medicine and, at the same time, present significant analytical challenges. Particularly, it is often the case that patient-level data in EHRs cannot be shared across institutions (data sources) due to government regulations and/or institutional policies. As a result, there are growing interests about distributed learning over multiple EHRs databases without sharing patient-level data. To tackle such challenges, we propose a novel communication efficient method that aggregates the local optimal estimates, by turning the problem into a missing data problem. In addition, we propose incorporating posterior samples of remote sites, which can provide partial information on the missing quantities and improve efficiency of parameter estimates while having the differential privacy property and thus reducing the risk of information leaking. The proposed approach, without sharing the raw patient level data, allows for proper statistical inference and can accommodate sparse regressions. We provide theoretical investigation for the asymptotic properties of the proposed method for statistical inference as well as differential privacy, and evaluate its performance in simulations and real data analyses in comparison with several recently developed methods.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源