论文标题

一种适当的方法,用于决定基于高斯混合群集的otrimle簇数

An adequacy approach for deciding the number of clusters for OTRIMLE robust Gaussian mixture based clustering

论文作者

Hennig, Christian, Coretto, Pietro

论文摘要

我们介绍了一种新方法来决定集群数量。该方法应用于最佳调整的最大似然估计(Otrimle; Coretto and Hennig 2016)的高斯混合物模型,允许观测值分类为“噪声”,但也可以应用于其他聚类方法。聚类的质量由统计$ Q $评估,该统计$ Q $衡量了集群内部分布与具有平均值唯一模式的椭圆形单峰分布的距离。这种非参数措施只要根据$ Q $具有良好的质量,就可以实现非高斯群集。模型的简单性是通过尺寸$ s $评估的,该度量$ s $更喜欢较少数量的簇,除非其他簇可以大大减少估计的噪声比例。然后,选择最简单的模型,即数据的观察值为$ Q $的意义不足,这一点不大于拟合模型真正生成的数据的预期,可以通过参数bootstrap评估。在模拟研究和两个科学意义的数据集中,使用贝叶斯信息标准(BIC)(BIC)(BIC)和基于模型的聚类进行了比较。关键字:参数bootstrap;噪声组件;单形态;基于模型的聚类

We introduce a new approach to deciding the number of clusters. The approach is applied to Optimally Tuned Robust Improper Maximum Likelihood Estimation (OTRIMLE; Coretto and Hennig 2016) of a Gaussian mixture model allowing for observations to be classified as "noise", but it can be applied to other clustering methods as well. The quality of a clustering is assessed by a statistic $Q$ that measures how close the within-cluster distributions are to elliptical unimodal distributions that have the only mode in the mean. This nonparametric measure allows for non-Gaussian clusters as long as they have a good quality according to $Q$. The simplicity of a model is assessed by a measure $S$ that prefers a smaller number of clusters unless additional clusters can reduce the estimated noise proportion substantially. The simplest model is then chosen that is adequate for the data in the sense that its observed value of $Q$ is not significantly larger than what is expected for data truly generated from the fitted model, as can be assessed by parametric bootstrap. The approach is compared with model-based clustering using the Bayesian Information Criterion (BIC) and the Integrated Complete Likelihood (ICL) in a simulation study and on two datasets of scientific interest. Keywords: parametric bootstrap; noise component; unimodality; model-based clustering

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源