论文标题

部分可观测时空混沌系统的无模型预测

BETULA: Numerically Stable CF-Trees for BIRCH Clustering

论文作者

Lang, Andreas, Schubert, Erich

论文摘要

桦木聚类是一种众所周知的聚类方法,它影响了后来的研究和商业产品。桦木的关键贡献是聚类特征树(CF-Tree),它是输入数据的压缩表示。随着新数据的到来,该树最终被重建以增加压缩。之后,将树的叶子用于聚类。由于数据压缩,此方法非常可扩展。该想法已被采用,例如针对K-均值,数据流和基于密度的聚类的想法。 桦木使用的聚类特征是简单的摘要统计信息,可以轻松地使用新数据更新:点数,线性总和和平方值的总和。不幸的是,桦木中如何使用正方形的总和容易取消灾难性。 我们引入了一个没有这个数字问题的替代群集功能,维护并不是要昂贵得多,并且使许多计算变得更简单,因此更有效。这些群集功能也可以轻松地用于衍生自桦木的其他工作,例如用于流数据的算法。在实验中,我们证明了数值问题并比较原始算法的性能与改进的群集特征相比。

BIRCH clustering is a widely known approach for clustering, that has influenced much subsequent research and commercial products. The key contribution of BIRCH is the Clustering Feature tree (CF-Tree), which is a compressed representation of the input data. As new data arrives, the tree is eventually rebuilt to increase the compression. Afterward, the leaves of the tree are used for clustering. Because of the data compression, this method is very scalable. The idea has been adopted for example for k-means, data stream, and density-based clustering. Clustering features used by BIRCH are simple summary statistics that can easily be updated with new data: the number of points, the linear sums, and the sum of squared values. Unfortunately, how the sum of squares is then used in BIRCH is prone to catastrophic cancellation. We introduce a replacement cluster feature that does not have this numeric problem, that is not much more expensive to maintain, and which makes many computations simpler and hence more efficient. These cluster features can also easily be used in other work derived from BIRCH, such as algorithms for streaming data. In the experiments, we demonstrate the numerical problem and compare the performance of the original algorithm compared to the improved cluster features.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源