论文标题
基于游戏理论的冗余无监督的排名 - 应用于基因富集分析
Redundancy-aware unsupervised ranking based on game theory -- application to gene enrichment analysis
论文作者
论文摘要
基因集合是研究特定表型特征基因富集的共同点。基因集富集分析旨在鉴定基因集集合中代表过多的基因,并且可能与特定的表型性状有关。但是,由于这涉及大量的假设检验,因此通常值得怀疑的是,减少基因集的预处理步骤是否有帮助。此外,通常高度重叠的基因集以及随之而来的基因集合集合的低解释性要求减少随附的基因集。受到这种生物信息学上下文的启发,我们提出了一种方法,可以根据单例及其大小在集合中进行排名。我们通过计算Shapley值而不陷入通常的指数级评估数量的评估来获得集合的重要性得分。此外,我们解决的挑战是,在我们的情况下,如果它们显示出突出的十字路口,则在获得的排名中包括冗余意识。我们最终评估了基因集收集的方法;获得的排名显示出低冗余和高覆盖基因。提议的排名的无监督性质不允许在减少集合的大小时明显增加特定表型特征的重要基因集数量。但是,我们认为提出的排名在生物信息学中使用,以提高基因集集合的解释性,并向前迈出一步,将冗余纳入沙普利价值计算中。
Gene set collections are a common ground to study the enrichment of genes for specific phenotypic traits. Gene set enrichment analysis aims to identify genes that are over-represented in gene sets collections and might be associated with a specific phenotypic trait. However, as this involves a massive number of hypothesis testing, it is often questionable whether a pre-processing step to reduce gene sets collections' sizes is helpful. Moreover, the often highly overlapping gene sets and the consequent low interpretability of gene sets' collections demand for a reduction of the included gene sets. Inspired by this bioinformatics context, we propose a method to rank sets within a family of sets based on the distribution of the singletons and their size. We obtain sets' importance scores by computing Shapley values without incurring into the usual exponential number of evaluations of the value function. Moreover, we address the challenge of including a redundancy awareness in the rankings obtained where, in our case, sets are redundant if they show prominent intersections. We finally evaluate our approach for gene sets collections; the rankings obtained show low redundancy and high coverage of the genes. The unsupervised nature of the proposed ranking does not allow for an evident increase in the number of significant gene sets for specific phenotypic traits when reducing the size of the collections. However, we believe that the rankings proposed are of use in bioinformatics to increase interpretability of the gene sets collections and a step forward to include redundancy into Shapley values computations.