论文标题
拉伸混合物的有效聚类:景观和最佳性
Efficient Clustering for Stretched Mixtures: Landscape and Optimality
论文作者
论文摘要
本文考虑了一个规范的聚类问题,其中一个人从两个椭圆形分布的平衡混合物中收到未标记的样品,并旨在分类器估计标签。包括PCA和K-均值在内的许多流行方法都要求混合物的单个组件有些球形,并且在拉伸时性能差。为了克服这个问题,我们提出了一个非凸线计划,以寻求仿射变换,以将数据转换为一维点云,该云集中在$ -1 $和1美元左右,然后集群变得容易。我们的理论贡献有两个折叠:(1)我们表明,当样本大小超过维度的一定恒定倍数时,非凸损失函数表现出理想的几何特性,(2)我们利用它来证明有效的一阶算法可以实现近距离的统计精确度,而无需良好的初始化。我们还提出了一种用于聚类的通用方法,可以灵活地选择特征转换和损失目标。
This paper considers a canonical clustering problem where one receives unlabeled samples drawn from a balanced mixture of two elliptical distributions and aims for a classifier to estimate the labels. Many popular methods including PCA and k-means require individual components of the mixture to be somewhat spherical, and perform poorly when they are stretched. To overcome this issue, we propose a non-convex program seeking for an affine transform to turn the data into a one-dimensional point cloud concentrating around $-1$ and $1$, after which clustering becomes easy. Our theoretical contributions are two-fold: (1) we show that the non-convex loss function exhibits desirable geometric properties when the sample size exceeds some constant multiple of the dimension, and (2) we leverage this to prove that an efficient first-order algorithm achieves near-optimal statistical precision without good initialization. We also propose a general methodology for clustering with flexible choices of feature transforms and loss objectives.