论文标题

带电子健康记录数据的树木指导的稀有特征选择和逻辑聚合数据

Tree-Guided Rare Feature Selection and Logic Aggregation with Electronic Health Records Data

论文作者

Chen, Jianmin, Aseltine, Robert H., Wang, Fei, Chen, Kun

论文摘要

分析电子健康记录(EHR)数据通常会遇到具有大量稀有二进制特征的统计学习,尤其是在具有先前的医学诊断和程序的疾病开始建模时。众所周知,处理产生的高度稀疏和大规模的二元特征矩阵是充满挑战的,因为传统方法可能缺乏模型拟合的测试和不一致性,而机器学习方法可能会遭受产生可解释结果或临床上临床上的风险因素的无法产生的。为了改善基于EHR的建模并利用疾病分类的自然层次结构,我们提出了树木制定的特征选择和逻辑聚合方法,用于具有稀有二进制特征的大规模回归,其中降低维度不仅是通过稀疏追求来实现的,而且还可以通过占用启动者的逻辑启动者来实现``'''或'或'或'或'或'或''的逻辑操作员。我们将组合问题转换为线性约束的正规化估计,该估计可以通过理论保证实现可扩展的计算。在使用EHR数据的自杀风险研究中,我们的方法能够在国际疾病的诊断层次结构指导下选择和汇总先前的心理健康诊断。通过平衡EHR诊断记录的稀有性和特异性,我们的策略改善了预测和模型解释。我们确定重要的高级类别和心理健康状况的子类别,并同时确定每个人在预测自杀风险中所需的特异性水平。

Statistical learning with a large number of rare binary features is commonly encountered in analyzing electronic health records (EHR) data, especially in the modeling of disease onset with prior medical diagnoses and procedures. Dealing with the resulting highly sparse and large-scale binary feature matrix is notoriously challenging as conventional methods may suffer from a lack of power in testing and inconsistency in model fitting while machine learning methods may suffer from the inability of producing interpretable results or clinically-meaningful risk factors. To improve EHR-based modeling and utilize the natural hierarchical structure of disease classification, we propose a tree-guided feature selection and logic aggregation approach for large-scale regression with rare binary features, in which dimension reduction is achieved through not only a sparsity pursuit but also an aggregation promoter with the logic operator of ``or''. We convert the combinatorial problem into a convex linearly-constrained regularized estimation, which enables scalable computation with theoretical guarantees. In a suicide risk study with EHR data, our approach is able to select and aggregate prior mental health diagnoses as guided by the diagnosis hierarchy of the International Classification of Diseases. By balancing the rarity and specificity of the EHR diagnosis records, our strategy improves both prediction and model interpretation. We identify important higher-level categories and subcategories of mental health conditions and simultaneously determine the level of specificity needed for each of them in predicting suicide risk.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源