对贪婪K-均值++的几乎紧密分析

论文标题

对贪婪K-均值++的几乎紧密分析

A Nearly Tight Analysis of Greedy k-means++

论文作者

Grunau, Christoph, Özüdoğru, Ahmet Alper, Rozhoň, Václav, Tětek, Jakub

论文摘要

Arthur和Vassilvitskii的著名$ K $ -MEANS ++算法[SODA 2007]是解决$ K $ - 英镑问题的最流行方式。该算法非常简单：它以随机的方式统一地对第一个中心进行采样，然后始终将每个$ K-1 $中心的中心取样与迄今为止最接近最接近的中心的平方距离成比例。之后，运行了劳埃德的迭代算法。已知$ k $ -Means ++算法可以返回预期的$θ（\ log K）$近似解决方案。在他们的开创性工作中，Arthur和Vassilvitskii [Soda 2007]询问了其以下\ emph {Greedy}的保证：在每一步中，我们采样了$ \ ell $候选中心，而不是一个，然后选择一个最小化新成本的中心。这也是$ k $ -Means ++在例如中实现的方式。流行的Scikit-Learn库[Pedregosa等人； JMLR 2011]。我们为贪婪的$ k $ -Means ++提供了几乎匹配的下限和上限：我们证明它是$ o（\ ell^3 \ log^3 K）$ - 近似算法。另一方面，我们证明了$ω（\ ell^3 \ log^3 k / \ log^2（\ ell \ log k））$的下限。以前，只有$ω（\ ell \ log k）$下限是已知的[Bhattacharya，Eube，Röglin，Schmidt; ESA 2020]，没有已知的上限。

The famous $k$-means++ algorithm of Arthur and Vassilvitskii [SODA 2007] is the most popular way of solving the $k$-means problem in practice. The algorithm is very simple: it samples the first center uniformly at random and each of the following $k-1$ centers is then always sampled proportional to its squared distance to the closest center so far. Afterward, Lloyd's iterative algorithm is run. The $k$-means++ algorithm is known to return a $Θ(\log k)$ approximate solution in expectation. In their seminal work, Arthur and Vassilvitskii [SODA 2007] asked about the guarantees for its following \emph{greedy} variant: in every step, we sample $\ell$ candidate centers instead of one and then pick the one that minimizes the new cost. This is also how $k$-means++ is implemented in e.g. the popular Scikit-learn library [Pedregosa et al.; JMLR 2011]. We present nearly matching lower and upper bounds for the greedy $k$-means++: We prove that it is an $O(\ell^3 \log^3 k)$-approximation algorithm. On the other hand, we prove a lower bound of $Ω(\ell^3 \log^3 k / \log^2(\ell\log k))$. Previously, only an $Ω(\ell \log k)$ lower bound was known [Bhattacharya, Eube, Röglin, Schmidt; ESA 2020] and there was no known upper bound.

下载PDF全文

下载文献需遵守相关版权规定

论文标题