论文标题
一个子采样算法阻止异常值
A sub-sampling algorithm preventing outliers
论文作者
论文摘要
如今,在许多不同的领域中,大量数据可用,出于多种原因,仅分析数据的一部分可能很方便。 D-急需标准的应用可能有助于最佳选择观测值的子样本。但是,众所周知,D-最佳的支撑点位于设计空间的边界上,如果它们与极端的响应值并驾齐驱,它们可能会严重影响估计的线性模型(具有很高影响力的杠杆点)。为了克服这个问题,首先,我们提出了一个无监督的交换程序,使我们能够在没有高杠杆值的情况下选择一个几乎最佳的观测值。然后,我们提供了此交换程序的监督版本,除了高杠杆点外,还避免了响应中的异常值(与高杠杆点无关)。这是可能的,因为与其他设计情况不同,在大数据集的亚采样中,响应值可能可用。 最后,无监督和监督的选择程序都概括为i-oftimality,目的是获得准确的预测。
Nowadays, in many different fields, massive data are available and for several reasons, it might be convenient to analyze just a subset of the data. The application of the D-optimality criterion can be helpful to optimally select a subsample of observations. However, it is well known that D-optimal support points lie on the boundary of the design space and if they go hand in hand with extreme response values, they can have a severe influence on the estimated linear model (leverage points with high influence). To overcome this problem, firstly, we propose an unsupervised exchange procedure that enables us to select a nearly D-optimal subset of observations without high leverage values. Then, we provide a supervised version of this exchange procedure, where besides high leverage points also the outliers in the responses (that are not associated to high leverage points) are avoided. This is possible because, unlike other design situations, in subsampling from big datasets the response values may be available. Finally, both the unsupervised and the supervised selection procedures are generalized to I-optimality, with the goal of getting accurate predictions.