论文标题

通过全球支持解决稀疏主组件分析

Solving sparse principal component analysis with global support

论文作者

Dey, Santanu S., Molinaro, Marco, Wang, Guanyi

论文摘要

具有全球支持(SPCAGS)的稀疏主组件分析是找到顶部$ r $领先的主组件的问题,以便所有这些主要组件都是最多$ k $变量的常见子集的线性组合。 SPCAGS是统计数据中流行的缩小维度工具,与常规的主成分分析(PCA)相比,可增强可解释性。在文献中解决SPCAG的方法是贪婪的启发式方法(在$ r = 1 $的特殊情况下),并在限制性统计模型下保证,或者在具有固定点融合的算法下进行一些正规化的SPCAG重新重新进行重新汇总。至关重要的是,现有的计算方法都无法通过将其与双重界限进行比较来有效保证获得的解决方案的质量。 在这项工作中,我们首先提出了一个基于操作员规范的凸放松,该规范可证明在$ C_1 + C_1 + C_2 \ sqrt {\ log r} = o(\ sqrt {\ sqrt {\ log r})中,某些常数$ c_1,c_1,c_2 $。为了证明这一结果,我们使用了一种新型的随机稀疏程序,该过程使用Pietsch-Grothendieck分解定理,并且可能具有独立的兴趣。我们还提出了一个更简单的放松,它是二阶可表示的,并给出了可行区域的$(2 \ sqrt {r})$ - 近似值。 然后,我们提出了一个凸整数程序,该程序为SPCAG的最佳值提供了双重绑定。此外,它还具有最差的保证:它在原始最佳值的乘法/添加因子之内,而乘法因子为$ o(\ log r)$或$ o(r)$,具体取决于所使用的放松。 最后,我们进行计算实验,以表明我们的凸整数程序在合理的时间内提供了良好的上限,通常比天然基线要好得多。

Sparse principal component analysis with global support (SPCAgs), is the problem of finding the top-$r$ leading principal components such that all these principal components are linear combinations of a common subset of at most $k$ variables. SPCAgs is a popular dimension reduction tool in statistics that enhances interpretability compared to regular principal component analysis (PCA). Methods for solving SPCAgs in the literature are either greedy heuristics (in the special case of $r = 1$) with guarantees under restrictive statistical models or algorithms with stationary point convergence for some regularized reformulation of SPCAgs. Crucially, none of the existing computational methods can efficiently guarantee the quality of the solutions obtained by comparing them against dual bounds. In this work, we first propose a convex relaxation based on operator norms that provably approximates the feasible region of SPCAgs within a $c_1 + c_2 \sqrt{\log r} = O(\sqrt{\log r})$ factor for some constants $c_1, c_2$. To prove this result, we use a novel random sparsification procedure that uses the Pietsch-Grothendieck factorization theorem and may be of independent interest. We also propose a simpler relaxation that is second-order cone representable and gives a $(2\sqrt{r})$-approximation for the feasible region. Using these relaxations, we then propose a convex integer program that provides a dual bound for the optimal value of SPCAgs. Moreover, it also has worst-case guarantees: it is within a multiplicative/additive factor of the original optimal value, and the multiplicative factor is $O(\log r)$ or $O(r)$ depending on the relaxation used. Finally, we conduct computational experiments that show that our convex integer program provides, within a reasonable time, good upper bounds that are typically significantly better than the natural baselines.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源