重新访问聚集聚类

论文标题

重新访问聚集聚类

Revisiting Agglomerative Clustering

论文作者

Tokuda, Eric K., Comin, Cesar H., Costa, Luciano da F.

论文摘要

聚类的一个重要问题是在寻找群集时避免误报。这项工作解决了考虑集聚方法的问题，即单一，平均，中值，完整，质心和沃德的方法，适用于单峰和双峰数据集，遵守统一，高斯，指数和幂律分布。还采用了簇模型，涉及较高的密度核，被过渡围绕，然后是离群值。这为定义从树状图中识别簇的客观手段的方式铺平了道路。所采用的模型还允许群集的相关性根据其子树的高度进行量化。获得的结果包括许多方法在单峰数据中检测两个簇的验证。发现单链方法对假阳性更具弹性。同样，几种未直接与核直接对应的簇检测到的方法。还研究了识别分布类型的可能性。

An important issue in clustering concerns the avoidance of false positives while searching for clusters. This work addressed this problem considering agglomerative methods, namely single, average, median, complete, centroid and Ward's approaches applied to unimodal and bimodal datasets obeying uniform, gaussian, exponential and power-law distributions. A model of clusters was also adopted, involving a higher density nucleus surrounded by a transition, followed by outliers. This paved the way to defining an objective means for identifying the clusters from dendrograms. The adopted model also allowed the relevance of the clusters to be quantified in terms of the height of their subtrees. The obtained results include the verification that many methods detect two clusters in unimodal data. The single-linkage method was found to be more resilient to false positives. Also, several methods detected clusters not corresponding directly to the nucleus. The possibility of identifying the type of distribution was also investigated.

下载PDF全文

下载文献需遵守相关版权规定

论文标题