论文标题
新的距离测量及其在K-均值算法中的应用
A new distance measurement and its application in K-Means Algorithm
论文作者
论文摘要
K-均值聚类算法是最常用的聚类算法之一,因为其简单性和效率。基于欧几里得距离的K-均值聚类算法仅注意样本之间的线性距离,但忽略了数据集的整体分布结构(即数据集的流体结构)。由于很难通过在高维数据空间中通过欧几里得距离来描述两个数据点的内部结构,因此我们提出了一个新的距离测量值,即观察距离,并将其应用于K-均值算法。在经典的多种学习数据集,S-Curve和Swiss Roll数据集上,此新距离不仅可以根据数据本身的结构聚集数据,而且类别之间的边界也是整齐的分界线。此外,我们还基于某些现实世界数据集的观察距离测试了K-均值算法的分类精度和聚类效应。实验结果表明,在大多数数据集上,基于观看距离的K均值算法具有一定程度的分类精度和聚类效果。
K-Means clustering algorithm is one of the most commonly used clustering algorithms because of its simplicity and efficiency. K-Means clustering algorithm based on Euclidean distance only pays attention to the linear distance between samples, but ignores the overall distribution structure of the dataset (i.e. the fluid structure of dataset). Since it is difficult to describe the internal structure of two data points by Euclidean distance in high-dimensional data space, we propose a new distance measurement, namely, view-distance, and apply it to the K-Means algorithm. On the classical manifold learning datasets, S-curve and Swiss roll datasets, not only this new distance can cluster the data according to the structure of the data itself, but also the boundaries between categories are neat dividing lines. Moreover, we also tested the classification accuracy and clustering effect of the K-Means algorithm based on view-distance on some real-world datasets. The experimental results show that, on most datasets, the K-Means algorithm based on view-distance has a certain degree of improvement in classification accuracy and clustering effect.