使用自动差异化的基于模型的聚类：面对错误指定和高维数据

论文标题

使用自动差异化的基于模型的聚类：面对错误指定和高维数据

Model-based Clustering using Automatic Differentiation: Confronting Misspecification and High-Dimensional Data

论文作者

Kasa, Siva Rajesh, Rajan, Vaibhav

论文摘要

我们使用高斯混合模型研究了两个实际上重要的基于模型聚类的案例：（1）鉴于使用自动分化（AD）基于梯度下降（GD）优化的最新进展，在高维数据上进行了错误指定和（2）。我们的仿真研究表明，与错误指定的情况相比，与GD相比，通过调整后的RAND指数衡量的EM具有更好的聚类性能，而在高维数据上，GD均优于EM。我们观察到，使用EM和GD都有许多可能性很可能但群集解释不良的解决方案。为了解决这个问题，我们根据拟合组件对之间的kullback leibler差异设计了一个新的罚款术语。对于这种惩罚可能性的梯度的封闭形式表达式很难得出，但可以轻松完成AD，这说明了基于AD的优化的优势。讨论了对高维数据和模型选择的这种惩罚的扩展。关于合成和实际数据集的数值实验证明了使用拟议的惩罚可能性方法进行聚类的功效。

We study two practically important cases of model based clustering using Gaussian Mixture Models: (1) when there is misspecification and (2) on high dimensional data, in the light of recent advances in Gradient Descent (GD) based optimization using Automatic Differentiation (AD). Our simulation studies show that EM has better clustering performance, measured by Adjusted Rand Index, compared to GD in cases of misspecification, whereas on high dimensional data GD outperforms EM. We observe that both with EM and GD there are many solutions with high likelihood but poor cluster interpretation. To address this problem we design a new penalty term for the likelihood based on the Kullback Leibler divergence between pairs of fitted components. Closed form expressions for the gradients of this penalized likelihood are difficult to derive but AD can be done effortlessly, illustrating the advantage of AD-based optimization. Extensions of this penalty for high dimensional data and for model selection are discussed. Numerical experiments on synthetic and real datasets demonstrate the efficacy of clustering using the proposed penalized likelihood approach.

下载PDF全文

下载文献需遵守相关版权规定

论文标题