局部隐藏基因组：使用贝叶斯多级上下文学习方法发现潜在的癌症突变主题

论文标题

局部隐藏基因组：使用贝叶斯多级上下文学习方法发现潜在的癌症突变主题

Topical Hidden Genome: Discovering Latent Cancer Mutational Topics using a Bayesian Multilevel Context-learning Approach

论文作者

Chakraborty, Saptarshi, Guan, Zoe, Begg, Colin B., Shen, Ronglai

论文摘要

集体超稀有基因组体细胞突变的癌症位点特异性的统计推断是一个开放问题。传统的统计方法由于其超高维度和极端数据的稀疏性无法处理全基因组突变数据 - 例如，在此处考虑的〜1700个全基因组肿瘤数据集中，观察到> 3000万个独特的变体，其中仅遇到> 99％的变体。为了利用这些稀有变体中的信息，我们最近提出了“隐藏基因组模型”，这是一种形式的多层次多逻辑模型，在超稀有的体细胞变体中挖掘信息以表征肿瘤类型。该模型通过利用单个突变的上下文的层次结构层中的稀有变体凝结。该模型目前是使用一致的可扩展点估计技术实施的，该技术可以处理在数千个肿瘤中检测到的数百万变体的10秒。我们最近的出版物证明了其令人印象深刻的准确性和大规模归因性。但是，由于突变环境的体积，相关性和不解剖性，该模型的原则统计推断是不可行的。在本文中，我们提出了一个新颖的框架，该框架利用计算语言学领域的主题模型诱导模型中使用的突变环境的 *可解释的维度降低 *。提出的模型是使用有效的MCMC算法实现的，该算法允许严格的全贝叶斯推断，该规模超出了超出现成的高维多级回归方法和软件的能力的数量级。我们将我们的模型用于整个基因组（PCAWG）数据集的PAN癌分析，我们的结果揭示了有趣的新见解。

Statistical inference on the cancer-site specificities of collective ultra-rare whole genome somatic mutations is an open problem. Traditional statistical methods cannot handle whole-genome mutation data due to their ultra-high-dimensionality and extreme data sparsity -- e.g., >30 million unique variants are observed in the ~1700 whole-genome tumor dataset considered herein, of which >99% variants are encountered only once. To harness information in these rare variants we have recently proposed the "hidden genome model", a formal multilevel multi-logistic model that mines information in ultra-rare somatic variants to characterize tumor types. The model condenses signals in rare variants through a hierarchical layer leveraging contexts of individual mutations. The model is currently implemented using consistent, scalable point estimation techniques that can handle 10s of millions of variants detected across thousands of tumors. Our recent publications have evidenced its impressive accuracy and attributability at scale. However, principled statistical inference from the model is infeasible due to the volume, correlation, and non-interpretability of the mutation contexts. In this paper we propose a novel framework that leverages topic models from the field of computational linguistics to induce an *interpretable dimension reduction* of the mutation contexts used in the model. The proposed model is implemented using an efficient MCMC algorithm that permits rigorous full Bayesian inference at a scale that is orders of magnitude beyond the capability of out-of-the-box high-dimensional multi-class regression methods and software. We employ our model on the Pan Cancer Analysis of Whole Genomes (PCAWG) dataset, and our results reveal interesting novel insights.

下载PDF全文

下载文献需遵守相关版权规定

论文标题