论文标题
部分可观测时空混沌系统的无模型预测
Robust Generalised Quadratic Discriminant Analysis
论文作者
论文摘要
二次判别分析(QDA)是一种广泛使用的统计工具,可分类来自不同多元正常种群的观察结果。广义二次判别分析(GQDA)分类规则/分类器,该规则/分类器概括了QDA和最小Mahalanobis距离(MMD)分类器,以区分范围的椭圆形分布与QDA分布相当优化并在QDA上表现不佳时,与QDA的分布相当优化,并且在QDA上进行QDA的分布情况较好。 Cauchy分布。但是,GQDA中的分类规则基于样本平均值矢量和训练样本的样品分散矩阵,在数据污染下,它们极为不舒适。在现实世界中,由于面对非常容易受到异常值的数据非常普遍,因此平均向量的经典估计器缺乏鲁棒性和分散矩阵会大大降低GQDA分类器的效率,从而增加了错误分类错误。本文调查了GQDA分类器的性能,而其中使用的平均向量和分散矩阵的经典估计器被各种强大的对应物取代。对各种实际数据集的应用以及仿真研究表明,GQDA分类器的稳健版本的性能要好得多。已经进行了比较研究,以提倡在数据集污染程度的特定情况下使用可靠的稳健估计器的适当选择。
Quadratic discriminant analysis (QDA) is a widely used statistical tool to classify observations from different multivariate Normal populations. The generalized quadratic discriminant analysis (GQDA) classification rule/classifier, which generalizes the QDA and the minimum Mahalanobis distance (MMD) classifiers to discriminate between populations with underlying elliptically symmetric distributions competes quite favorably with the QDA classifier when it is optimal and performs much better when QDA fails under non-Normal underlying distributions, e.g. Cauchy distribution. However, the classification rule in GQDA is based on the sample mean vector and the sample dispersion matrix of a training sample, which are extremely non-robust under data contamination. In real world, since it is quite common to face data highly vulnerable to outliers, the lack of robustness of the classical estimators of the mean vector and the dispersion matrix reduces the efficiency of the GQDA classifier significantly, increasing the misclassification errors. The present paper investigates the performance of the GQDA classifier when the classical estimators of the mean vector and the dispersion matrix used therein are replaced by various robust counterparts. Applications to various real data sets as well as simulation studies reveal far better performance of the proposed robust versions of the GQDA classifier. A Comparative study has been made to advocate the appropriate choice of the robust estimators to be used in a specific situation of the degree of contamination of the data sets.