论文标题

选择具有稳定性权衡的簇$ k $的数量:内部验证标准

Selecting the Number of Clusters $K$ with a Stability Trade-off: an Internal Validation Criterion

论文作者

Mourer, Alex, Forest, Florent, Lebbah, Mustapha, Azzag, Hanane, Lacaille, Jérôme

论文摘要

模型选择是非参数聚类的主要挑战。没有普遍承认的方法来评估聚类结果,这是出于明显的原因,即没有任何基础真相可用。找到普遍评估标准的困难是聚类目标错误目标的结果。从这个角度来看,聚类稳定性已成为一种天然和模型不合稳定的原理:算法应在数据中找到稳定的结构。如果数据集从相同的基础分布中反复采样,则算法应找到相似的分区。但是,仅稳定性并不适合确定簇的数量。例如,它无法检测出簇数是否太小。我们提出了一个新的原则:一个良好的聚类应该稳定,在每个集群中,不应存在稳定的分区。该原理导致基于群集间和群内稳定性,克服以前基于稳定性的方法的局限性的新型聚类验证标准。我们从经验上证明了我们标准选择簇数并将其与现有方法进行比较的有效性。代码可在https://github.com/florentf9/skstab上找到。

Model selection is a major challenge in non-parametric clustering. There is no universally admitted way to evaluate clustering results for the obvious reason that no ground truth is available. The difficulty to find a universal evaluation criterion is a consequence of the ill-defined objective of clustering. In this perspective, clustering stability has emerged as a natural and model-agnostic principle: an algorithm should find stable structures in the data. If data sets are repeatedly sampled from the same underlying distribution, an algorithm should find similar partitions. However, stability alone is not well-suited to determine the number of clusters. For instance, it is unable to detect if the number of clusters is too small. We propose a new principle: a good clustering should be stable, and within each cluster, there should exist no stable partition. This principle leads to a novel clustering validation criterion based on between-cluster and within-cluster stability, overcoming limitations of previous stability-based methods. We empirically demonstrate the effectiveness of our criterion to select the number of clusters and compare it with existing methods. Code is available at https://github.com/FlorentF9/skstab.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源