主成分分析的分布式估计：扩大的特征空间分析

论文标题

主成分分析的分布式估计：扩大的特征空间分析

Distributed Estimation for Principal Component Analysis: an Enlarged Eigenspace Analysis

论文作者

Chen, Xi, Lee, Jason D., Li, He, Yang, Yun

论文摘要

现代数据集的规模不断增长为现有的统计估计方法带来了许多挑战，这需要新的分布式方法。本文研究了基本统计机器学习问题，主成分分析（PCA）的分布估计。尽管关于特征向量的最高文献进行了大量文献，但对于顶级$ l $ -DIM（$ l> 1 $）的特征估计，较少提出的文献却少得多，尤其是以分布式方式。我们提出了一种新颖的多轮算法，用于构建用于分布式数据的顶部$ l $ -DIM eigenspace。我们的算法利用了转换预处理和凸优化。我们的估计器是沟通效率的，并且达到了快速的收敛速度。与现有的分隔和争议算法相反，我们的方法对机器数量没有限制。从理论上讲，传统的Davis-Kahan定理需要明确的特征款假设来估计顶级$ l $ -DIM eigenspace。为了放弃这一特定假设，我们考虑了分析中的一条新路线：而不是精确地识别顶级$ L $ -DIM EIGENSPACE，而是表明我们的估计器能够涵盖目标的顶级$ L $ -DIM人口Eigenspace。我们的分布式算法可以应用于基于PCA的广泛统计问题，例如主组件回归和单个索引模型。最后，我们提供仿真研究以证明所提出的分布式估计器的性能。

The growing size of modern data sets brings many challenges to the existing statistical estimation approaches, which calls for new distributed methodologies. This paper studies distributed estimation for a fundamental statistical machine learning problem, principal component analysis (PCA). Despite the massive literature on top eigenvector estimation, much less is presented for the top-$L$-dim ($L>1$) eigenspace estimation, especially in a distributed manner. We propose a novel multi-round algorithm for constructing top-$L$-dim eigenspace for distributed data. Our algorithm takes advantage of shift-and-invert preconditioning and convex optimization. Our estimator is communication-efficient and achieves a fast convergence rate. In contrast to the existing divide-and-conquer algorithm, our approach has no restriction on the number of machines. Theoretically, the traditional Davis-Kahan theorem requires the explicit eigengap assumption to estimate the top-$L$-dim eigenspace. To abandon this eigengap assumption, we consider a new route in our analysis: instead of exactly identifying the top-$L$-dim eigenspace, we show that our estimator is able to cover the targeted top-$L$-dim population eigenspace. Our distributed algorithm can be applied to a wide range of statistical problems based on PCA, such as principal component regression and single index model. Finally, We provide simulation studies to demonstrate the performance of the proposed distributed estimator.

下载PDF全文

下载文献需遵守相关版权规定

论文标题