论文标题
Genie:一种新的,快速且耐离群的层次聚类算法
Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm
论文作者
论文摘要
应用分层聚类算法所需的时间最常由成对差异度量的计算数量主导。对于较大的数据集,这种约束使使用所有经典链接标准的使用不利,但单个链接一个链接标准。但是,众所周知,单个链接聚类算法对离群值非常敏感,产生高度偏斜的树状图,因此通常不会反映出真正的潜在数据结构 - 除非群集分离良好。为了克服其局限性,我们提出了一个名为Genie的新的分层群集链接标准。也就是说,我们的算法将两个簇链接到两个群集的方式,即群集尺寸的经济不平等措施(例如,Gini-或Bonferroni-Index)并不会大大增加超过给定阈值。提出的基准表明引入的方法具有很高的实际实用性:它通常优于病房或平均链接,而群集质量则保持了单个连锁的速度。 Genie算法很容易平行,因此可以在多个线程上运行以进一步加快其执行。它的内存开销很小:无需预先计算完整的距离矩阵即可执行计算以获得所需的群集。它可以应用于配备有差异度度量的任意空间,例如,对算法的实际矢量,DNA或蛋白质序列,图像,排名,信息数据等。
The time needed to apply a hierarchical clustering algorithm is most often dominated by the number of computations of a pairwise dissimilarity measure. Such a constraint, for larger data sets, puts at a disadvantage the use of all the classical linkage criteria but the single linkage one. However, it is known that the single linkage clustering algorithm is very sensitive to outliers, produces highly skewed dendrograms, and therefore usually does not reflect the true underlying data structure -- unless the clusters are well-separated. To overcome its limitations, we propose a new hierarchical clustering linkage criterion called Genie. Namely, our algorithm links two clusters in such a way that a chosen economic inequity measure (e.g., the Gini- or Bonferroni-index) of the cluster sizes does not drastically increase above a given threshold. The presented benchmarks indicate a high practical usefulness of the introduced method: it most often outperforms the Ward or average linkage in terms of the clustering quality while retaining the single linkage's speed. The Genie algorithm is easily parallelizable and thus may be run on multiple threads to speed up its execution even further. Its memory overhead is small: there is no need to precompute the complete distance matrix to perform the computations in order to obtain a desired clustering. It can be applied on arbitrary spaces equipped with a dissimilarity measure, e.g., on real vectors, DNA or protein sequences, images, rankings, informetric data, etc. A reference implementation of the algorithm has been included in the open source 'genie' package for R. See also https://genieclust.gagolewski.com for a new implementation (genieclust) -- available for both R and Python.