用于椭圆形分布混合物的强大而灵活的EM算法与缺少的数据

论文标题

用于椭圆形分布混合物的强大而灵活的EM算法与缺少的数据

A Robust and Flexible EM Algorithm for Mixtures of Elliptical Distributions with Missing Data

论文作者

Mouret, Florian, Hippert-Ferrer, Alexandre, Pascal, Frédéric, Tourneret, Jean-Yves

论文摘要

本文解决了缺少嘈杂和非高斯数据数据的数据的问题。与其他流行的方法相比，一种经典的插补方法，即高斯混合模型的期望最大化（EM）算法，它显示出有趣的特性，例如基于K-Nearealt邻居或通过链式方程式进行多个归纳的方法。但是，已知高斯混合模型对异质数据不舒适，当数据被异常值污染或遵循非高斯分布时，这可能导致估计性能差。为了克服这个问题，研究了一种新的EM算法，用于椭圆形分布的混合物以及处理潜在丢失数据的属性。本文表明，此问题减少了在通用假设下的角度高斯分布的混合物的估计（即，每个样品都是从椭圆形分布的混合物中绘制的，这对于一个样品而言可能是不同的）。在这种情况下，与椭圆形分布混合物相关的完整数据可能非常适合EM框架，由于其条件分布而缺少数据，这被证明是多元$ t $分布。合成数据的实验结果表明，所提出的算法对异常值是可靠的，可以与非高斯数据一起使用。此外，在现实世界数据集上进行的实验表明，与其他经典插补方法相比，该算法非常有竞争力。

This paper tackles the problem of missing data imputation for noisy and non-Gaussian data. A classical imputation method, the Expectation Maximization (EM) algorithm for Gaussian mixture models, has shown interesting properties when compared to other popular approaches such as those based on k-nearest neighbors or on multiple imputations by chained equations. However, Gaussian mixture models are known to be non-robust to heterogeneous data, which can lead to poor estimation performance when the data is contaminated by outliers or follows non-Gaussian distributions. To overcome this issue, a new EM algorithm is investigated for mixtures of elliptical distributions with the property of handling potential missing data. This paper shows that this problem reduces to the estimation of a mixture of Angular Gaussian distributions under generic assumptions (i.e., each sample is drawn from a mixture of elliptical distributions, which is possibly different for one sample to another). In that case, the complete-data likelihood associated with mixtures of elliptical distributions is well adapted to the EM framework with missing data thanks to its conditional distribution, which is shown to be a multivariate $t$-distribution. Experimental results on synthetic data demonstrate that the proposed algorithm is robust to outliers and can be used with non-Gaussian data. Furthermore, experiments conducted on real-world datasets show that this algorithm is very competitive when compared to other classical imputation methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题