Banditpam：几乎线性时间$ K $ -MEDOIDS通过多军匪徒聚类

论文标题

Banditpam：几乎线性时间$ K $ -MEDOIDS通过多军匪徒聚类

BanditPAM: Almost Linear Time $k$-Medoids Clustering via Multi-Armed Bandits

论文作者

Tiwari, Mo, Zhang, Martin Jinye, Mayclin, James, Thrun, Sebastian, Piech, Chris, Shomorony, Ilan

论文摘要

聚类是数据科学中无处不在的任务。与常用的$ k $ -Means聚类相比，$ k $ -Medoids聚类需要集群中心为实际数据点并支持任意距离指标，这允许更大的解释性和结构化对象的聚类。当前的最新$ K $ -MEDOIDS聚类算法，例如围绕Medoids（PAM）进行分区（PAM），是迭代性的，在数据集尺寸$ n $的每次迭代中都是二次的，对于大型数据集而言非常昂贵。我们提出了BanditPam，这是一种受多军匪徒技术启发的随机算法，将每种PAM迭代的复杂性从$ O（N^2）$降低到$ O（N \ log n \ log n）$，并以相同的结果返回相同的结果，并以很高的可能性返回，在实践中经常存在的数据下。因此，Banditpam匹配最新的聚类损失，同时更快地达到解决方案。我们从经验上验证了几个大型现实世界数据集的结果，包括编码练习提交数据集，10倍基因组学68K PBMC单细胞单细胞RNA测序数据集和MNIST手写数字数据集。在这些实验中，我们观察到Banditpam返回与最先进的PAM样算法相同的结果，同时执行的距离计算少于200倍。 Banditpam所展示的改进使$ K $ -MEDOIDS聚集在广泛的应用程序上，包括识别大型单细胞数据中的单元格类型，并为在线学习计算机科学的学生提供可扩展的反馈。我们还发布了算法的高度优化的Python和C ++实现。

Clustering is a ubiquitous task in data science. Compared to the commonly used $k$-means clustering, $k$-medoids clustering requires the cluster centers to be actual data points and support arbitrary distance metrics, which permits greater interpretability and the clustering of structured objects. Current state-of-the-art $k$-medoids clustering algorithms, such as Partitioning Around Medoids (PAM), are iterative and are quadratic in the dataset size $n$ for each iteration, being prohibitively expensive for large datasets. We propose BanditPAM, a randomized algorithm inspired by techniques from multi-armed bandits, that reduces the complexity of each PAM iteration from $O(n^2)$ to $O(n \log n)$ and returns the same results with high probability, under assumptions on the data that often hold in practice. As such, BanditPAM matches state-of-the-art clustering loss while reaching solutions much faster. We empirically validate our results on several large real-world datasets, including a coding exercise submissions dataset, the 10x Genomics 68k PBMC single-cell RNA sequencing dataset, and the MNIST handwritten digits dataset. In these experiments, we observe that BanditPAM returns the same results as state-of-the-art PAM-like algorithms up to 4x faster while performing up to 200x fewer distance computations. The improvements demonstrated by BanditPAM enable $k$-medoids clustering on a wide range of applications, including identifying cell types in large-scale single-cell data and providing scalable feedback for students learning computer science online. We also release highly optimized Python and C++ implementations of our algorithm.

下载PDF全文

下载文献需遵守相关版权规定

论文标题