论文标题
量化多个来源到公共聚类中心:渐近分析
Quantizing Multiple Sources to a Common Cluster Center: An Asymptotic Analysis
论文作者
论文摘要
我们考虑将$ ld $维的样本量化,该样本是通过将$ l $ l $ vectors从$ d $二维向量串联到$ d $ d $维群集中心获得的。失真度量是集群中心和样品之间距离的加权总和。对于$ l = 1 $,一个人恢复了普通的基于中心的聚类公式。当一个人希望通过其每个成员的$ L噪声观察结果将数据集群集时,将会出现一般情况$ l> 1 $。我们找到了一个渐近状态中集群中心数量较大的平均失真性能的公式。我们还提供了一种算法来数值优化群集中心并验证我们对真实和人工数据集的分析结果。就原始(无噪声)数据集的忠诚而言,我们的聚类方法的表现优于依赖于将$ ld $维噪声观察向量量化为$ ld $维中心的天真方法。
We consider quantizing an $Ld$-dimensional sample, which is obtained by concatenating $L$ vectors from datasets of $d$-dimensional vectors, to a $d$-dimensional cluster center. The distortion measure is the weighted sum of $r$th powers of the distances between the cluster center and the samples. For $L=1$, one recovers the ordinary center based clustering formulation. The general case $L>1$ appears when one wishes to cluster a dataset through $L$ noisy observations of each of its members. We find a formula for the average distortion performance in the asymptotic regime where the number of cluster centers are large. We also provide an algorithm to numerically optimize the cluster centers and verify our analytical results on real and artificial datasets. In terms of faithfulness to the original (noiseless) dataset, our clustering approach outperforms the naive approach that relies on quantizing the $Ld$-dimensional noisy observation vectors to $Ld$-dimensional centers.