用可扩展的GPLVS在SCRNA-SEQ数据中建模技术和生物学效应

论文标题

用可扩展的GPLVS在SCRNA-SEQ数据中建模技术和生物学效应

Modelling Technical and Biological Effects in scRNA-seq data with Scalable GPLVMs

论文作者

Lalchand, Vidhi, Ravuri, Aditya, Dann, Emma, Kumasaka, Natsuhiko, Sumanaweera, Dinithi, Lindeboom, Rik G. H., Madad, Shaista, Teichmann, Sarah A., Lawrence, Neil D.

论文摘要

单细胞RNA-seq数据集的大小和复杂性正在增长，从而可以研究各种生物/临床环境中的细胞组成变化。可扩展的降低性降低技术需要消除它们的生物学变异，同时考虑技术和生物混杂因素。在这项工作中，我们扩展了一种流行的概率非线性维度降低的方法，即高斯过程潜在变量模型，以扩展到大量的单细胞数据集，同时明确考虑技术和生物混杂因素。关键思想是使用增强的内核，该内核可以保留下限的可分式性，从而允许快速随机变化推断。我们证明了其在Kumasaka等人中重建先天免疫的潜在签名的能力。（2021）训练时间较低9倍。我们进一步分析了一个共同数据集并在130个人群中证明了该框架，该框架可以在捕获可解释的感染签名的同时进行数据集成。具体而言，我们探讨了共同严重程度作为优化患者分层并捕获疾病特异性基因表达的潜在维度。

Single-cell RNA-seq datasets are growing in size and complexity, enabling the study of cellular composition changes in various biological/clinical contexts. Scalable dimensionality reduction techniques are in need to disentangle biological variation in them, while accounting for technical and biological confounders. In this work, we extend a popular approach for probabilistic non-linear dimensionality reduction, the Gaussian process latent variable model, to scale to massive single-cell datasets while explicitly accounting for technical and biological confounders. The key idea is to use an augmented kernel which preserves the factorisability of the lower bound allowing for fast stochastic variational inference. We demonstrate its ability to reconstruct latent signatures of innate immunity recovered in Kumasaka et al. (2021) with 9x lower training time. We further analyze a COVID dataset and demonstrate across a cohort of 130 individuals, that this framework enables data integration while capturing interpretable signatures of infection. Specifically, we explore COVID severity as a latent dimension to refine patient stratification and capture disease-specific gene expression.

下载PDF全文

下载文献需遵守相关版权规定

论文标题