论文标题

快速分布式K-均值少数回合

Fast Distributed k-Means with a Small Number of Rounds

论文作者

Hess, Tom, Visbord, Ron, Sabato, Sivan

论文摘要

我们为在分布式设置中提出了一种用于K-均值聚类的新算法,其中数据是在许多机器上分布的,并且协调器与这些计算机进行通信以计算输出群集。我们的算法保证了成本近似因素和仅取决于协调员的计算能力的许多通信回合。此外,该算法包括一种内置的停止机制,该机制可在可能的情况下使用更少的通信回合。我们从理论和经验上都表明,在许多自然情况下,确实足以1-4轮。与受欢迎的K-均值相比||算法,我们的方法允许利用更大的协调能力来获得较小数量的回合。我们的实验表明,即使允许后者有更多的回合,也是k均值||获得的K-均值的K-均值成本通常优于K-均值||的成本。此外,我们方法中的机器运行时间比K-均值||的机器运行时间要小得多。用于运行算法和实验的代码可在https://github.com/selotape/distributed_k_means上找到。

We propose a new algorithm for k-means clustering in a distributed setting, where the data is distributed across many machines, and a coordinator communicates with these machines to calculate the output clustering. Our algorithm guarantees a cost approximation factor and a number of communication rounds that depend only on the computational capacity of the coordinator. Moreover, the algorithm includes a built-in stopping mechanism, which allows it to use fewer communication rounds whenever possible. We show both theoretically and empirically that in many natural cases, indeed 1-4 rounds suffice. In comparison with the popular k-means|| algorithm, our approach allows exploiting a larger coordinator capacity to obtain a smaller number of rounds. Our experiments show that the k-means cost obtained by the proposed algorithm is usually better than the cost obtained by k-means||, even when the latter is allowed a larger number of rounds. Moreover, the machine running time in our approach is considerably smaller than that of k-means||. Code for running the algorithm and experiments is available at https://github.com/selotape/distributed_k_means.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源