将大规模稀疏PCA求解到可认证（接近）最优性

论文标题

将大规模稀疏PCA求解到可认证（接近）最优性

Solving Large-Scale Sparse PCA to Certifiable (Near) Optimality

论文作者

Bertsimas, Dimitris, Cory-Wright, Ryan, Pauphilet, Jean

论文摘要

稀疏主成分分析（PCA）是一种流行的维度降低技术，用于获得原始特征的一小部分的线性组合。现有方法无法提供超过$ p = 100s $变量的最佳最佳主要组件。通过将稀疏PCA重新定义为凸的混合企业半芬矿化优化问题，我们设计了一种切削平面方法，该方法可以从p = 300变量中选择k = 5个协变量的规模来解决该问题，从而确定了可确认的最优性，并在较大范围内提供了小界限。我们还提出了一个凸放松和贪婪的圆形方案，该计划在几分钟内以$ p = 100 $ s或$ p = 1,000 $ s的几分钟内提供$ 1-2 \％$的限制差距，因此是大规模确切方法的可行替代方案。使用现实世界的财务和医疗数据集，我们说明了我们的方法在大规模上得出可解释的主组件的能力。

Sparse principal component analysis (PCA) is a popular dimensionality reduction technique for obtaining principal components which are linear combinations of a small subset of the original features. Existing approaches cannot supply certifiably optimal principal components with more than $p=100s$ of variables. By reformulating sparse PCA as a convex mixed-integer semidefinite optimization problem, we design a cutting-plane method which solves the problem to certifiable optimality at the scale of selecting k=5 covariates from p=300 variables, and provides small bound gaps at a larger scale. We also propose a convex relaxation and greedy rounding scheme that provides bound gaps of $1-2\%$ in practice within minutes for $p=100$s or hours for $p=1,000$s and is therefore a viable alternative to the exact method at scale. Using real-world financial and medical datasets, we illustrate our approach's ability to derive interpretable principal components tractably at scale.

下载PDF全文

下载文献需遵守相关版权规定

论文标题