论文标题

对贪婪K-均值++的几乎紧密分析

A Nearly Tight Analysis of Greedy k-means++

论文作者

Grunau, Christoph, Özüdoğru, Ahmet Alper, Rozhoň, Václav, Tětek, Jakub

论文摘要

Arthur和Vassilvitskii的著名$ K $ -MEANS ++算法[SODA 2007]是解决$ K $ - 英镑问题的最流行方式。该算法非常简单:它以随机的方式统一地对第一个中心进行采样,然后始终将每个$ K-1 $中心的中心取样与迄今为止最接近最接近的中心的平方距离成比例。之后,运行了劳埃德的迭代算法。已知$ k $ -Means ++算法可以返回预期的$θ(\ log K)$近似解决方案。 在他们的开创性工作中,Arthur和Vassilvitskii [Soda 2007]询问了其以下\ emph {Greedy}的保证:在每一步中,我们采样了$ \ ell $候选中心,而不是一个,然后选择一个最小化新成本的中心。这也是$ k $ -Means ++在例如中实现的方式。流行的Scikit-Learn库[Pedregosa等人; JMLR 2011]。 我们为贪婪的$ k $ -Means ++提供了几乎匹配的下限和上限:我们证明它是$ o(\ ell^3 \ log^3 K)$ - 近似算法。另一方面,我们证明了$ω(\ ell^3 \ log^3 k / \ log^2(\ ell \ log k))$的下限。以前,只有$ω(\ ell \ log k)$下限是已知的[Bhattacharya,Eube,Röglin,Schmidt; ESA 2020],没有已知的上限。

The famous $k$-means++ algorithm of Arthur and Vassilvitskii [SODA 2007] is the most popular way of solving the $k$-means problem in practice. The algorithm is very simple: it samples the first center uniformly at random and each of the following $k-1$ centers is then always sampled proportional to its squared distance to the closest center so far. Afterward, Lloyd's iterative algorithm is run. The $k$-means++ algorithm is known to return a $Θ(\log k)$ approximate solution in expectation. In their seminal work, Arthur and Vassilvitskii [SODA 2007] asked about the guarantees for its following \emph{greedy} variant: in every step, we sample $\ell$ candidate centers instead of one and then pick the one that minimizes the new cost. This is also how $k$-means++ is implemented in e.g. the popular Scikit-learn library [Pedregosa et al.; JMLR 2011]. We present nearly matching lower and upper bounds for the greedy $k$-means++: We prove that it is an $O(\ell^3 \log^3 k)$-approximation algorithm. On the other hand, we prove a lower bound of $Ω(\ell^3 \log^3 k / \log^2(\ell\log k))$. Previously, only an $Ω(\ell \log k)$ lower bound was known [Bhattacharya, Eube, Röglin, Schmidt; ESA 2020] and there was no known upper bound.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源