论文标题
基于索引的解决方案,用于有效密度峰值聚类
Index-based Solutions for Efficient Density Peak Clustering
论文作者
论文摘要
密度峰值聚类(DPC)是一种流行的基于密度的聚类方法,主要是由于其简单性和参数要求较少,因此受到了研究社区的极大关注。但是,使用DPC获得的结果群集受敏感参数$ D_C $的影响,该参数取决于数据分布和不同用户的要求。此外,原始的DPC算法需要访问大量对象,从而使其慢。为此,本文研究了基于索引的DPC解决方案。具体来说,我们建议两种基于列表的索引方法,即。 (i)简单列表索引,(ii)高级累积直方图索引。为这些指数提出了有效的查询算法,该算法显着避免了以空间成本而无关的比较。对于内存受限的系统,我们进一步引入了上述指数的近似解决方案,该指标可以大大降低空间成本,前提是轻微的不准确性是可以接受的。此外,由于现有基于树的索引结构的内存要求大大降低,我们还提出了有效的修剪技术和有效的查询算法,以使用流行的QuadTree索引和R-Tree索引支持DPC。最后,我们实际上评估了上述所有索引,并提出了从六个合成和实际数据集的一系列广泛实验获得的发现和结果。获得的实验见解可以帮助指导选择合适的指数。
Density Peak Clustering (DPC), a popular density-based clustering approach, has received considerable attention from the research community primarily due to its simplicity and fewer-parameter requirement. However, the resultant clusters obtained using DPC are influenced by the sensitive parameter $d_c$, which depends on data distribution and requirements of different users. Besides, the original DPC algorithm requires visiting a large number of objects, making it slow. To this end, this paper investigates index-based solutions for DPC. Specifically, we propose two list-based index methods viz. (i) a simple List Index, and (ii) an advanced Cumulative Histogram Index. Efficient query algorithms are proposed for these indices which significantly avoids irrelevant comparisons at the cost of space. For memory-constrained systems, we further introduce an approximate solution to the above indices which allows substantial reduction in the space cost, provided that slight inaccuracies are admissible. Furthermore, owing to considerably lower memory requirements of existing tree-based index structures, we also present effective pruning techniques and efficient query algorithms to support DPC using the popular Quadtree Index and R-tree Index. Finally, we practically evaluate all the above indices and present the findings and results, obtained from a set of extensive experiments on six synthetic and real datasets. The experimental insights obtained can help to guide in selecting a befitting index.