论文标题
贝叶斯非参数混合物不一致的组件数量:我们应该如何担心?
Bayesian nonparametric mixture inconsistency for the number of components: How worried should we be in practice?
论文作者
论文摘要
我们考虑有限混合物(MFM)和Dirichlet工艺混合物(DPM)模型的贝叶斯混合物。最近的渐近理论已经确定,DPM高估了大型样本的聚类数量,并且两类模型的估计量对于未指定的群集的数量不一致,但是对有限样本分析的含义尚不清楚。拟合这些模型后的最终报告的估计通常是使用MCMC摘要技术获得的单个代表性聚类,但是尚不清楚这样的摘要估计簇的数量。在这里,我们通过模拟和对基因表达数据的应用进行调查,发现(i)DPM甚至在有限的样本中高估了簇的数量,但仅在有限的程度上可以使用适当的摘要来纠正,并且(ii)错误指定可以导致dpms和mfms中的clusters数量的大量高估。我们提供有关MCMC摘要的建议,并建议尽管MFM的更具吸引力的渐近性能提供了强大的动力,但使用MFM和DPM获得的结果通常在实践中非常相似。
We consider the Bayesian mixture of finite mixtures (MFMs) and Dirichlet process mixture (DPM) models for clustering. Recent asymptotic theory has established that DPMs overestimate the number of clusters for large samples and that estimators from both classes of models are inconsistent for the number of clusters under misspecification, but the implications for finite sample analyses are unclear. The final reported estimate after fitting these models is often a single representative clustering obtained using an MCMC summarisation technique, but it is unknown how well such a summary estimates the number of clusters. Here we investigate these practical considerations through simulations and an application to gene expression data, and find that (i) DPMs overestimate the number of clusters even in finite samples, but only to a limited degree that may be correctable using appropriate summaries, and (ii) misspecification can lead to considerable overestimation of the number of clusters in both DPMs and MFMs, but results are nevertheless often still interpretable. We provide recommendations on MCMC summarisation and suggest that although the more appealing asymptotic properties of MFMs provide strong motivation to prefer them, results obtained using MFMs and DPMs are often very similar in practice.