论文标题
罕见事件的大量数据的逻辑回归
Logistic Regression for Massive Data with Rare Events
论文作者
论文摘要
本文研究了稀有事件数据或不平衡数据的二进制逻辑回归,其中事件的数量(一个类别中的观测值,通常称为病例)明显小于非文献的数量(另一类的观察值,通常称为控制)。我们首先得出未知参数的最大似然估计量(MLE)的渐近分布,这表明渐近方差收敛以零为零,以事件数量的逆速率而不是完整数据样本大小的倒数。这表明稀有事件数据中的可用信息是事件数量的规模,而不是完整的数据样本量。此外,我们证明了不足的一小部分非因素,所得不采样的估计量可能与完整数据MLE具有相同的渐近分布。这证明了对罕见事件数据进行无效的不足的优势,因为此过程可能会大大降低计算和/或数据收集成本。分析稀有事件数据的另一个常见做法是过度样本(重复)事件,该事件具有较高的计算成本。我们表明,此过程甚至可能导致参数估计的效率损失。
This paper studies binary logistic regression for rare events data, or imbalanced data, where the number of events (observations in one class, often called cases) is significantly smaller than the number of nonevents (observations in the other class, often called controls). We first derive the asymptotic distribution of the maximum likelihood estimator (MLE) of the unknown parameter, which shows that the asymptotic variance convergences to zero in a rate of the inverse of the number of the events instead of the inverse of the full data sample size. This indicates that the available information in rare events data is at the scale of the number of events instead of the full data sample size. Furthermore, we prove that under-sampling a small proportion of the nonevents, the resulting under-sampled estimator may have identical asymptotic distribution to the full data MLE. This demonstrates the advantage of under-sampling nonevents for rare events data, because this procedure may significantly reduce the computation and/or data collection costs. Another common practice in analyzing rare events data is to over-sample (replicate) the events, which has a higher computational cost. We show that this procedure may even result in efficiency loss in terms of parameter estimation.