论文标题

具有混合功能的大数据聚类

Clustering of Big Data with Mixed Features

论文作者

Tobin, Joshua, Zhang, Mimi

论文摘要

聚集大型混合数据是数据挖掘的核心问题。许多方法都采用了K-均值的概念,因此对初始化敏感,仅检测球形簇,并且需要先验不明数的簇数。我们在这里开发了一种用于混合类型的大数据的新聚类算法,旨在提高峰调格技术的适用性和效率。改进是三倍:(1)新算法适用于混合数据; (2)该算法能够检测相对较低的密度值的异常值和簇; (3)该算法有能力确定正确数量的簇数。通过应用快速的k-nearen邻居方法并将其扩展到组件集,可以大大降低算法的计算复杂性。我们提出了实验结果,以验证我们的算法在实践中效果很好。关键字:聚类;大数据;混合属性;密度峰;最近的邻居图;电导。

Clustering large, mixed data is a central problem in data mining. Many approaches adopt the idea of k-means, and hence are sensitive to initialisation, detect only spherical clusters, and require a priori the unknown number of clusters. We here develop a new clustering algorithm for large data of mixed type, aiming at improving the applicability and efficiency of the peak-finding technique. The improvements are threefold: (1) the new algorithm is applicable to mixed data; (2) the algorithm is capable of detecting outliers and clusters of relatively lower density values; (3) the algorithm is competent at deciding the correct number of clusters. The computational complexity of the algorithm is greatly reduced by applying a fast k-nearest neighbors method and by scaling down to component sets. We present experimental results to verify that our algorithm works well in practice. Keywords: Clustering; Big Data; Mixed Attribute; Density Peaks; Nearest-Neighbor Graph; Conductance.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源