论文标题
数据结构和算法,用于层次聚类中精确推断
Data Structures & Algorithms for Exact Inference in Hierarchical Clustering
论文作者
论文摘要
分层聚类是一项基本任务,通常用于发现数据中有意义的结构,例如系统发育树,概念的分类法,癌症的亚型以及粒子物理学粒子衰变的级联反应。通常,由于可能的分层聚类的组合数,近似算法用于推断。与现有方法相反,我们提出了基于新型的格子数据结构的\ emph {Exact}推断的新型动态编程算法,并且我们证明我们可以精确地计算分区功能,最大的约不能等级结构,以及下层和层次结构和层次结构的边际概率。我们的算法在时间和空间上与$ n $元素的powerset成正比的时间和空间比例比例,比明确考虑(2n-3)中的每个(2n-3)更高效!可能的层次结构。同样,对于我们确切的算法变得不可行的较大数据集,我们基于稀疏格子的算法引入了近似算法,该算法与其他基准测试良好。确切的方法与粒子物理学的数据分析有关,以及在癌症基因组中找到基因表达之间的相关性,我们在这两个领域中给出了示例,我们的算法表现优于贪婪和束探索基线。此外,我们考虑了Dasgupta与合成数据的成本。
Hierarchical clustering is a fundamental task often used to discover meaningful structures in data, such as phylogenetic trees, taxonomies of concepts, subtypes of cancer, and cascades of particle decays in particle physics. Typically approximate algorithms are used for inference due to the combinatorial number of possible hierarchical clusterings. In contrast to existing methods, we present novel dynamic-programming algorithms for \emph{exact} inference in hierarchical clustering based on a novel trellis data structure, and we prove that we can exactly compute the partition function, maximum likelihood hierarchy, and marginal probabilities of sub-hierarchies and clusters. Our algorithms scale in time and space proportional to the powerset of $N$ elements which is super-exponentially more efficient than explicitly considering each of the (2N-3)!! possible hierarchies. Also, for larger datasets where our exact algorithms become infeasible, we introduce an approximate algorithm based on a sparse trellis that compares well to other benchmarks. Exact methods are relevant to data analyses in particle physics and for finding correlations among gene expression in cancer genomics, and we give examples in both areas, where our algorithms outperform greedy and beam search baselines. In addition, we consider Dasgupta's cost with synthetic data.