拉伸混合物的有效聚类：景观和最佳性

论文标题

拉伸混合物的有效聚类：景观和最佳性

Efficient Clustering for Stretched Mixtures: Landscape and Optimality

论文作者

Wang, Kaizheng, Yan, Yuling, Díaz, Mateo

论文摘要

本文考虑了一个规范的聚类问题，其中一个人从两个椭圆形分布的平衡混合物中收到未标记的样品，并旨在分类器估计标签。包括PCA和K-均值在内的许多流行方法都要求混合物的单个组件有些球形，并且在拉伸时性能差。为了克服这个问题，我们提出了一个非凸线计划，以寻求仿射变换，以将数据转换为一维点云，该云集中在$ -1 $和1美元左右，然后集群变得容易。我们的理论贡献有两个折叠：（1）我们表明，当样本大小超过维度的一定恒定倍数时，非凸损失函数表现出理想的几何特性，（2）我们利用它来证明有效的一阶算法可以实现近距离的统计精确度，而无需良好的初始化。我们还提出了一种用于聚类的通用方法，可以灵活地选择特征转换和损失目标。

This paper considers a canonical clustering problem where one receives unlabeled samples drawn from a balanced mixture of two elliptical distributions and aims for a classifier to estimate the labels. Many popular methods including PCA and k-means require individual components of the mixture to be somewhat spherical, and perform poorly when they are stretched. To overcome this issue, we propose a non-convex program seeking for an affine transform to turn the data into a one-dimensional point cloud concentrating around $-1$ and $1$, after which clustering becomes easy. Our theoretical contributions are two-fold: (1) we show that the non-convex loss function exhibits desirable geometric properties when the sample size exceeds some constant multiple of the dimension, and (2) we leverage this to prove that an efficient first-order algorithm achieves near-optimal statistical precision without good initialization. We also propose a general methodology for clustering with flexible choices of feature transforms and loss objectives.

下载PDF全文

下载文献需遵守相关版权规定

论文标题