论文标题
使用基于距离的可分离性度量的内部集群有效性指数
An Internal Cluster Validity Index Using a Distance-based Separability Measure
论文作者
论文摘要
评估聚类结果是聚类分析的重要组成部分。在典型的无监督学习中没有真正的班级标签。因此,已经创建了许多使用预测标签和数据的内部评估。它们也被命名为内部群集有效指数(CVI)。没有真正的标签,设计有效的CVI并不简单,因为它与创建聚类方法相似。而且,拥有更多的CVI是至关重要的,因为没有通用CVI可以用于测量所有数据集,也没有针对没有真正标签的群集选择合适的CVI的特定方法。因此,必须使用更多的CVI来评估聚类结果。在本文中,我们根据数据可分离性度量提出了一种新型的CVI-称为基于距离的可分离性指数(DSI)。我们将DSI和其他八个内部CVI应用于包括Dunn(1974)的早期研究(1974年)上的最新研究CVDD(2019)作为比较。我们将外部CVI用作地面真理,用于在12个真实和97个合成数据集上进行五种聚类算法的聚类结果。结果表明,与其他CVI相比,DSI是一种有效,独特且具有竞争性的CVI。此外,我们总结了评估CVI的一般过程,并创建了一种新方法 - 等级差异 - 以比较CVI的结果。
To evaluate clustering results is a significant part of cluster analysis. There are no true class labels for clustering in typical unsupervised learning. Thus, a number of internal evaluations, which use predicted labels and data, have been created. They are also named internal cluster validity indices (CVIs). Without true labels, to design an effective CVI is not simple because it is similar to create a clustering method. And, to have more CVIs is crucial because there is no universal CVI that can be used to measure all datasets, and no specific method for selecting a proper CVI for clusters without true labels. Therefore, to apply more CVIs to evaluate clustering results is necessary. In this paper, we propose a novel CVI - called Distance-based Separability Index (DSI), based on a data separability measure. We applied the DSI and eight other internal CVIs including early studies from Dunn (1974) to most recent studies CVDD (2019) as comparison. We used an external CVI as ground truth for clustering results of five clustering algorithms on 12 real and 97 synthetic datasets. Results show DSI is an effective, unique, and competitive CVI to other compared CVIs. In addition, we summarized the general process to evaluate CVIs and created a new method - rank difference - to compare the results of CVIs.