论文标题

列表在几乎PCA时间内可解码的平均估计

List-Decodable Mean Estimation in Nearly-PCA Time

论文作者

Diakonikolas, Ilias, Kane, Daniel M., Kongsgaard, Daniel, Li, Jerry, Tian, Kevin

论文摘要

传统上,强大的统计数据集中在设计估计器对少数受污染数据的耐受性。可靠的列表可解码的学习重点是更具挑战性的制度,在这种制度中,从兴趣的分布中得出数据集的少数$ \ frac 1 k $,并且对其余数据没有任何假设。我们研究了高维度中可列表的平均值估计的基本任务。我们的主要结果是针对有界的协方差分布的新列表可解码的平均值估计算法,具有最佳的样本复杂性和错误率,几乎在PCA时间内运行。假设$ \ mathbb {r}^d $上的地面真相分布具有有限的协方差,我们的算法输出了$ o(k)$候选的列表,其中之一在距离$ o(\ sqrt {k})范围内。我们的算法在时间上运行$ \ widetilde {o}(ndk)$ for All $ k = o(\ sqrt {d})\cupΩ(d)$,其中$ n $是数据集的大小。我们还表明,我们的算法的一种变体具有所有$ k $的运行时$ \ wideTilde {o}(ndk)$,以$ o(\ sqrt {\ log k})为代价。此运行时与对数因素相匹配,在数据上执行单个$ k $ -pca的成本,这是我们问题(非常)特殊情况的已知算法的自然瓶颈,例如聚集了分离良好的混合物。在我们的工作之前,最快的列表可解码平均值估计算法具有runtimes $ \ widetilde {o}(n^2 d k^2)$和$ \ widetilde {o}(nd k^{\ ge 6})$。 我们的方法基于一种新颖的柔软减肥方法,即$ \ mathsf {sift} $,这可以说是可列表可定性学习设置中最简单的多项式时间均值估计技术。为了开发快速算法,我们通过仔细的“ win-win-win-win-win”分析来提高$ \ mathsf {sift} $的计算成本,对我们开发的近似Ky粉丝矩阵乘法程序的分析,我们认为这可能是独立的。

Traditionally, robust statistics has focused on designing estimators tolerant to a minority of contaminated data. Robust list-decodable learning focuses on the more challenging regime where only a minority $\frac 1 k$ fraction of the dataset is drawn from the distribution of interest, and no assumptions are made on the remaining data. We study the fundamental task of list-decodable mean estimation in high dimensions. Our main result is a new list-decodable mean estimation algorithm for bounded covariance distributions with optimal sample complexity and error rate, running in nearly-PCA time. Assuming the ground truth distribution on $\mathbb{R}^d$ has bounded covariance, our algorithm outputs a list of $O(k)$ candidate means, one of which is within distance $O(\sqrt{k})$ from the truth. Our algorithm runs in time $\widetilde{O}(ndk)$ for all $k = O(\sqrt{d}) \cup Ω(d)$, where $n$ is the size of the dataset. We also show that a variant of our algorithm has runtime $\widetilde{O}(ndk)$ for all $k$, at the expense of an $O(\sqrt{\log k})$ factor in the recovery guarantee. This runtime matches up to logarithmic factors the cost of performing a single $k$-PCA on the data, which is a natural bottleneck of known algorithms for (very) special cases of our problem, such as clustering well-separated mixtures. Prior to our work, the fastest list-decodable mean estimation algorithms had runtimes $\widetilde{O}(n^2 d k^2)$ and $\widetilde{O}(nd k^{\ge 6})$. Our approach builds on a novel soft downweighting method, $\mathsf{SIFT}$, which is arguably the simplest known polynomial-time mean estimation technique in the list-decodable learning setting. To develop our fast algorithms, we boost the computational cost of $\mathsf{SIFT}$ via a careful "win-win-win" analysis of an approximate Ky Fan matrix multiplicative weights procedure we develop, which we believe may be of independent interest.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源