论文标题
评估和验证群集结果
Evaluating and Validating Cluster Results
论文作者
论文摘要
聚类是根据数据的特征对数据进行分区的技术。本质上相似的数据属于同一群集[1]。评估聚类质量的评估方法有两种类型。一个是一个外部评估,其中的真实标签是事先知道的,另一个是内部评估,其中评估是在数据集本身的情况下完成的,而没有真正的标签。在本文中,对虹膜数据集的群集结果进行了外部评估和内部评估。在外部评估同质性的情况下,计算数据集的正确性和V量分数。对于内部性能指标,使用剪影指数和平方误差总和。首先使用这些内部性能度量以及树状图(来自分层聚类的图形工具)来验证簇数。最后,作为统计工具,我们使用频率分布方法来比较并提供聚类结果和原始数据中观测值分布的视觉表示。
Clustering is the technique to partition data according to their characteristics. Data that are similar in nature belong to the same cluster [1]. There are two types of evaluation methods to evaluate clustering quality. One is an external evaluation where the truth labels in the data sets are known in advance and the other is internal evaluation in which the evaluation is done with data set itself without true labels. In this paper, both external evaluation and internal evaluation are performed on the cluster results of the IRIS dataset. In the case of external evaluation Homogeneity, Correctness and V-measure scores are calculated for the dataset. For internal performance measures, the Silhouette Index and Sum of Square Errors are used. These internal performance measures along with the dendrogram (graphical tool from hierarchical Clustering) are used first to validate the number of clusters. Finally, as a statistical tool, we used the frequency distribution method to compare and provide a visual representation of the distribution of observations within a clustering result and the original data.