论文标题
测量误差对聚类算法的影响
The effect of measurement error on clustering algorithms
论文作者
论文摘要
聚类由一组流行的技术组成,用于将数据分为有趣的组以进行进一步分析。许多执行聚类的数据源都是众所周知的,这些数据源众所周知,这些数据源包含随机和系统的测量误差。此类错误可能会对聚类产生不利影响。尽管已经开发了几种解决这个问题的技术,但对这些解决方案的有效性知之甚少。此外,迄今为止,尚无待办事项研究系统错误对聚类解决方案的影响。 在本文中,我们进行了一项蒙特卡洛研究,以研究两种常见的聚类算法,与合并和DBSCAN的GMM的敏感性,以便随机和系统的误差。我们发现,当测量误差是系统的,并且影响数据集中的所有变量时,它尤其有问题。对于此处考虑的条件,我们还发现,基于合并组件的基于分区的GMM对测量误差的敏感性不如基于密度的DBSCAN程序敏感。
Clustering consists of a popular set of techniques used to separate data into interesting groups for further analysis. Many data sources on which clustering is performed are well-known to contain random and systematic measurement errors. Such errors may adversely affect clustering. While several techniques have been developed to deal with this problem, little is known about the effectiveness of these solutions. Moreover, no work to-date has examined the effect of systematic errors on clustering solutions. In this paper, we perform a Monte Carlo study to investigate the sensitivity of two common clustering algorithms, GMMs with merging and DBSCAN, to random and systematic error. We find that measurement error is particularly problematic when it is systematic and when it affects all variables in the dataset. For the conditions considered here, we also find that the partition-based GMM with merged components is less sensitive to measurement error than the density-based DBSCAN procedure.