论文标题

大规模最小二乘估计的核心元素

Core-Elements for Large-Scale Least Squares Estimation

论文作者

Li, Mengyu, Yu, Jun, Li, Tao, Meng, Cheng

论文摘要

核心方法的方法(也称为子采样或子集选择)旨在选择一个子样本作为观察到的样本的替代物,并在大规模数据分析中发现了广泛的应用。现有核心方法使用预测矩阵中的行子集构建子集。当预测矩阵稀疏或数值稀疏时,此类方法效率显着效率低下。为了克服这一局限性,我们开发了一种新型元素的子集选择方法,称为核心元素,以进行大规模最小二乘估计。 We provide a deterministic algorithm to construct the core-elements estimator, only requiring an $O(\mathrm{nnz}(X)+rp^2)$ computational cost, where $X$ is an $n\times p$ predictor matrix, $r$ is the number of elements selected from each column of $X$, and $\mathrm{nnz}(\cdot)$ denotes the number of非零元素。从理论上讲,我们表明所提出的估计器是公正的,并且近似估计方差的上限。我们还通过得出针对所提出的估计器结合的核心样品样品来提供近似保证。为了处理数据中的潜在离群值,我们将核心元素与均值过程相结合,从而产生了具有理论一致性保证的有效且健壮的估计器。关于各种合成和现实世界数据集的数值研究表明,与主流竞争者相比,该方法的表现出色。

The coresets approach, also called subsampling or subset selection, aims to select a subsample as a surrogate for the observed sample and has found extensive applications in large-scale data analysis. Existing coresets methods construct the subsample using a subset of rows from the predictor matrix. Such methods can be significantly inefficient when the predictor matrix is sparse or numerically sparse. To overcome this limitation, we develop a novel element-wise subset selection approach, called core-elements, for large-scale least squares estimation. We provide a deterministic algorithm to construct the core-elements estimator, only requiring an $O(\mathrm{nnz}(X)+rp^2)$ computational cost, where $X$ is an $n\times p$ predictor matrix, $r$ is the number of elements selected from each column of $X$, and $\mathrm{nnz}(\cdot)$ denotes the number of non-zero elements. Theoretically, we show that the proposed estimator is unbiased and approximately minimizes an upper bound of the estimation variance. We also provide an approximation guarantee by deriving a coresets-like finite sample bound for the proposed estimator. To handle potential outliers in the data, we further combine core-elements with the median-of-means procedure, resulting in an efficient and robust estimator with theoretical consistency guarantees. Numerical studies on various synthetic and real-world datasets demonstrate the proposed method's superior performance compared to mainstream competitors.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源