具有混合功能的大数据聚类

论文标题

具有混合功能的大数据聚类

Clustering of Big Data with Mixed Features

论文作者

Tobin, Joshua, Zhang, Mimi

论文摘要

聚集大型混合数据是数据挖掘的核心问题。许多方法都采用了K-均值的概念，因此对初始化敏感，仅检测球形簇，并且需要先验不明数的簇数。我们在这里开发了一种用于混合类型的大数据的新聚类算法，旨在提高峰调格技术的适用性和效率。改进是三倍：（1）新算法适用于混合数据；（2）该算法能够检测相对较低的密度值的异常值和簇；（3）该算法有能力确定正确数量的簇数。通过应用快速的k-nearen邻居方法并将其扩展到组件集，可以大大降低算法的计算复杂性。我们提出了实验结果，以验证我们的算法在实践中效果很好。关键字：聚类；大数据；混合属性；密度峰；最近的邻居图；电导。

Clustering large, mixed data is a central problem in data mining. Many approaches adopt the idea of k-means, and hence are sensitive to initialisation, detect only spherical clusters, and require a priori the unknown number of clusters. We here develop a new clustering algorithm for large data of mixed type, aiming at improving the applicability and efficiency of the peak-finding technique. The improvements are threefold: (1) the new algorithm is applicable to mixed data; (2) the algorithm is capable of detecting outliers and clusters of relatively lower density values; (3) the algorithm is competent at deciding the correct number of clusters. The computational complexity of the algorithm is greatly reduced by applying a fast k-nearest neighbors method and by scaling down to component sets. We present experimental results to verify that our algorithm works well in practice. Keywords: Clustering; Big Data; Mixed Attribute; Density Peaks; Nearest-Neighbor Graph; Conductance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题