论文标题

rfphen2gen:基于机器学习的脑成像表型与基因型的关联研究

rfPhen2Gen: A machine learning based association study of brain imaging phenotypes to genotypes

论文作者

Malik, Muhammad Ammar, Lundervold, Alexander S., Michoel, Tom

论文摘要

成像遗传研究旨在找到遗传变异与成像定量性状之间的关联。传统的全基因组关联研究(GWAS)是基于单变量统计检验的,但是当将多个性状分析在一起时,它们会遭受多次测试问题的困扰,并且不考虑特征之间的相关性。多特征GWAS的另一种方法是通过拟合多元回归模型来同时预测多个特征的基因型,以扭转基因型和性状之间的功能关系。但是,当前的反向基因型预测方法主要基于线性模型。在这里,我们评估了随机森林回归(RFR),作为一种预测Imaging QTS SNP并确定生物学相关关联的方法。我们学习了机器学习模型,可以使用56个脑成像QT来预测518,484个SNP。我们观察到,基因型回归误差比基因型分类精度更好地指标了置换p值显着性。已知的阿尔茨海默氏病(AD)风险基因APOE的SNP对于拉索和随机森林的RMSE最低,但山脊消退却没有。此外,随机森林确定了其他SNP,这些SNP尚未由线性模型确定,但已知与脑相关疾病有关。特征选择确定了与AD相关的众所周知的大脑区域,例如海马和杏仁核,是最重要的SNP的重要预测指标。总而言之,我们的结果表明,与传统的线性多变量GWAS方法相比,非线性方法可能会提供对表型基因型关联的更多见解。

Imaging genetic studies aim to find associations between genetic variants and imaging quantitative traits. Traditional genome-wide association studies (GWAS) are based on univariate statistical tests, but when multiple traits are analyzed together they suffer from a multiple-testing problem and from not taking into account correlations among the traits. An alternative approach to multi-trait GWAS is to reverse the functional relation between genotypes and traits, by fitting a multivariate regression model to predict genotypes from multiple traits simultaneously. However, current reverse genotype prediction approaches are mostly based on linear models. Here, we evaluated random forest regression (RFR) as a method to predict SNPs from imaging QTs and identify biologically relevant associations. We learned machine learning models to predict 518,484 SNPs using 56 brain imaging QTs. We observed that genotype regression error is a better indicator of permutation p-value significance than genotype classification accuracy. SNPs within the known Alzheimer disease (AD) risk gene APOE had lowest RMSE for lasso and random forest, but not ridge regression. Moreover, random forests identified additional SNPs that were not prioritized by the linear models but are known to be associated with brain-related disorders. Feature selection identified well-known brain regions associated with AD,like the hippocampus and amygdala, as important predictors of the most significant SNPs. In summary, our results indicate that non-linear methods like random forests may offer additional insights into phenotype-genotype associations compared to traditional linear multi-variate GWAS methods.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源