论文标题

用于椭圆形分布混合物的强大而灵活的EM算法与缺少的数据

A Robust and Flexible EM Algorithm for Mixtures of Elliptical Distributions with Missing Data

论文作者

Mouret, Florian, Hippert-Ferrer, Alexandre, Pascal, Frédéric, Tourneret, Jean-Yves

论文摘要

本文解决了缺少嘈杂和非高斯数据数据的数据的问题。与其他流行的方法相比,一种经典的插补方法,即高斯混合模型的期望最大化(EM)算法,它显示出有趣的特性,例如基于K-Nearealt邻居或通过链式方程式进行多个归纳的方法。但是,已知高斯混合模型对异质数据不舒适,当数据被异常值污染或遵循非高斯分布时,这可能导致估计性能差。为了克服这个问题,研究了一种新的EM算法,用于椭圆形分布的混合物以及处理潜在丢失数据的属性。本文表明,此问题减少了在通用假设下的角度高斯分布的混合物的估计(即,每个样品都是从椭圆形分布的混合物中绘制的,这对于一个样品而言可能是不同的)。在这种情况下,与椭圆形分布混合物相关的完整数据可能非常适合EM框架,由于其条件分布而缺少数据,这被证明是多元$ t $分布。合成数据的实验结果表明,所提出的算法对异常值是可靠的,可以与非高斯数据一起使用。此外,在现实世界数据集上进行的实验表明,与其他经典插补方法相比,该算法非常有竞争力。

This paper tackles the problem of missing data imputation for noisy and non-Gaussian data. A classical imputation method, the Expectation Maximization (EM) algorithm for Gaussian mixture models, has shown interesting properties when compared to other popular approaches such as those based on k-nearest neighbors or on multiple imputations by chained equations. However, Gaussian mixture models are known to be non-robust to heterogeneous data, which can lead to poor estimation performance when the data is contaminated by outliers or follows non-Gaussian distributions. To overcome this issue, a new EM algorithm is investigated for mixtures of elliptical distributions with the property of handling potential missing data. This paper shows that this problem reduces to the estimation of a mixture of Angular Gaussian distributions under generic assumptions (i.e., each sample is drawn from a mixture of elliptical distributions, which is possibly different for one sample to another). In that case, the complete-data likelihood associated with mixtures of elliptical distributions is well adapted to the EM framework with missing data thanks to its conditional distribution, which is shown to be a multivariate $t$-distribution. Experimental results on synthetic data demonstrate that the proposed algorithm is robust to outliers and can be used with non-Gaussian data. Furthermore, experiments conducted on real-world datasets show that this algorithm is very competitive when compared to other classical imputation methods.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源