大型样品脊回归的最佳子采样

论文标题

大型样品脊回归的最佳子采样

Optimal Subsampling for Large Sample Ridge Regression

论文作者

Chen, Yunlu, Zhang, Nan

论文摘要

亚采样是减轻分析大量数据集的计算负担的流行方法。最近的努力专门用于各种统计模型，而无需明确的正则化。在本文中，我们为大型样品线性脊回归制定了有效的子采样程序。与普通的最小平方估计器相反，引入山脊惩罚会导致偏见与差异之间的微妙权衡。我们首先研究了亚采样估计量的渐近性能，然后提议最大程度地降低渐近均方根校正标准以获得最佳性。由此产生的子采样概率涉及脊杠杆得分和预测因子的L2标准。为了进一步降低计算脊杠杆分数的计算成本，我们提出了具有有效近似的算法。我们通过合成和真实数据集证明，与现有基于子采样的方法相比，该算法在统计学上是准确和计算上有效的。

Subsampling is a popular approach to alleviating the computational burden for analyzing massive datasets. Recent efforts have been devoted to various statistical models without explicit regularization. In this paper, we develop an efficient subsampling procedure for the large sample linear ridge regression. In contrast to the ordinary least square estimator, the introduction of the ridge penalty leads to a subtle trade-off between bias and variance. We first investigate the asymptotic properties of the subsampling estimator and then propose to minimize the asymptotic-mean-squared-error criterion for optimality. The resulting subsampling probability involves both ridge leverage score and L2 norm of the predictor. To further reduce the computational cost for calculating the ridge leverage scores, we propose the algorithm with efficient approximation. We show by synthetic and real datasets that the algorithm is both statistically accurate and computationally efficient compared with existing subsampling based methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题