论文标题

使用Logistic正常多项式模型的混合物将微生物组数据聚类

Clustering microbiome data using mixtures of logistic normal multinomial models

论文作者

Fang, Yuan, Subedi, Sanjeena

论文摘要

在生物信息学中,常规遇到了下一代测序引起的微生物组分类量的离散数据,例如由下一代测序产生的数据。微生物组研究中的分类单元计数数据通常是高维,过度分散的,并且只能揭示相对丰度,从而被视为组成。分析组成数据提出了许多挑战,因为它们受到单纯限制。在逻辑正常的多项式模型中,相对丰度从单纯形映射到使用添加剂比例转换在实际欧几里得空间上存在的潜在变量。尽管逻辑正常的多项式方法具有建模数据的灵活性,但由于参数估计通常依赖于贝叶斯技术,因此它具有沉重的计算成本。在本文中,我们开发了逻辑正常多项式模型的新型混合物,用于聚类微生物组数据。此外,我们使用变分高斯近似(VGA)利用有效的框架进行参数估计。在潜在变量的后部采用差异高斯近似可大大降低计算开销。在模拟和真实数据集上说明了所提出的方法。

Discrete data such as counts of microbiome taxa resulting from next-generation sequencing are routinely encountered in bioinformatics. Taxa count data in microbiome studies are typically high-dimensional, over-dispersed, and can only reveal relative abundance therefore being treated as compositional. Analyzing compositional data presents many challenges because they are restricted on a simplex. In a logistic normal multinomial model, the relative abundance is mapped from a simplex to a latent variable that exists on the real Euclidean space using the additive log-ratio transformation. While a logistic normal multinomial approach brings in flexibility for modeling the data, it comes with a heavy computational cost as the parameter estimation typically relies on Bayesian techniques. In this paper, we develop a novel mixture of logistic normal multinomial models for clustering microbiome data. Additionally, we utilize an efficient framework for parameter estimation using variational Gaussian approximations (VGA). Adopting a variational Gaussian approximation for the posterior of the latent variable reduces the computational overhead substantially. The proposed method is illustrated on simulated and real datasets.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源