论文标题

惊喜抽样:改善和扩展本地案例对照采样

Surprise sampling: improving and extending the local case-control sampling

论文作者

Shen, Xinwei, Chen, Kani, Yu, Wen

论文摘要

Fithian and Hastie(2014)提出了一种新的抽样方案,称为本地病例对照(LCC)采样,该方案通过利用与物流模型有关的巧妙调整来实现稳定性和效率。对于使用大型和不平衡数据的分类特别有用。本文提出了一个基于工作原理的更通用的采样方案,如果数据点包含更多信息或在例如,例如,在例如试点预测或较大的绝对得分的大误差的意义上,数据点应该具有更高的采样概率。与Fithian和Hastie(2014)和AI等人报道的相关现有抽样方案相比。 (2018年),提议的一个优点。它适应性地将最佳形式赋予了各种目标,包括LCC和AI等人。 (2018年)作为特殊情况。在相同的模型规范下,提出的估计器的性能也不比文献中的估计值差。即使模型被误指定和/或试验估计器不一致或取决于完整数据,估计过程也是有效的。我们提供了估计和抽样设计的优势和最优性的理论理由。与AI,等人不同。 (2018年),我们的大型样本理论是人口的,而不是数据。此外,所提出的方法可以应用于无监督的学习研究,因为它本质上只需要特定的损失函数,并且不需要数据的响应范围结构。进行了数值研究,并显示了支持该理论的证据。

Fithian and Hastie (2014) proposed a new sampling scheme called local case-control (LCC) sampling that achieves stability and efficiency by utilizing a clever adjustment pertained to the logistic model. It is particularly useful for classification with large and imbalanced data. This paper proposes a more general sampling scheme based on a working principle that data points deserve higher sampling probability if they contain more information or appear "surprising" in the sense of, for example, a large error of pilot prediction or a large absolute score. Compared with the relevant existing sampling schemes, as reported in Fithian and Hastie (2014) and Ai, et al. (2018), the proposed one has several advantages. It adaptively gives out the optimal forms to a variety of objectives, including the LCC and Ai et al. (2018)'s sampling as special cases. Under same model specifications, the proposed estimator also performs no worse than those in the literature. The estimation procedure is valid even if the model is misspecified and/or the pilot estimator is inconsistent or dependent on full data. We present theoretical justifications of the claimed advantages and optimality of the estimation and the sampling design. Different from Ai, et al. (2018), our large sample theory are population-wise rather than data-wise. Moreover, the proposed approach can be applied to unsupervised learning studies, since it essentially only requires a specific loss function and no response-covariate structure of data is needed. Numerical studies are carried out and the evidence in support of the theory is shown.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源