论文标题
设置聚类
Sets Clustering
论文作者
论文摘要
\ emph {sets- $ k $ -means}问题的输入是一个整数$ k \ geq 1 $和set $ \ mathcal {p} = \ {p_1,\ cdots,p_n \} $ in $ \ setbb {r} r}^d $。目标是在$ \ mathbb {r}^d $中计算$ k $中心的$ c $ c $ c $ c $ k $中心(点),以最大程度地减少sum $ \ sum_ {p \ in \ mathcal {p}}} \ min_ {p \ in c} P-C \ right \ |^2 $的平方距离。对于此问题,一个\ emph {$ \ varepsilon $ -core-set}是$ \ nathcal {p} $的加权子集,该子集近似于此总和$ 1 \ pm \ pm \ pm \ varepsilon $ action,对于\ emph {emph {every} set $ k $ c $ c $ c $ c $ c $ c $ c $ c $ c $ c $ c $ c $ c $ c $ c $ c $ c $ c $ c $ c $ c $ \ mathbbbbbbbb = $} $}我们证明,这种核心设置为$ o(\ log^2 {n})$ sets始终存在,并且可以以$ o(n \ log {n})$时间计算,对于每个输入$ \ Mathcal {p} $和每个已固定的$ d,k \ egeq 1 $和$ \ varepsilon \ in(0,1,1)$。结果很容易在任何度量空间,$ z> 0 $的功率的距离以及处理异常值的M-估计器中进行概括。在此核心上应用效率低下但最佳的算法使我们能够获得第一个PTA($ 1+\ varepsilon $近似),用于sets-$ k $ -means问题,该问题在$ n $中的线性接近。即使对于飞机上的集合均值,这也是第一个结果($ k = 1 $,$ d = 2 $)。还提供了用于文档分类和设施位置的开源代码和实验结果。
The input to the \emph{sets-$k$-means} problem is an integer $k\geq 1$ and a set $\mathcal{P}=\{P_1,\cdots,P_n\}$ of sets in $\mathbb{R}^d$. The goal is to compute a set $C$ of $k$ centers (points) in $\mathbb{R}^d$ that minimizes the sum $\sum_{P\in \mathcal{P}} \min_{p\in P, c\in C}\left\| p-c \right\|^2$ of squared distances to these sets. An \emph{$\varepsilon$-core-set} for this problem is a weighted subset of $\mathcal{P}$ that approximates this sum up to $1\pm\varepsilon$ factor, for \emph{every} set $C$ of $k$ centers in $\mathbb{R}^d$. We prove that such a core-set of $O(\log^2{n})$ sets always exists, and can be computed in $O(n\log{n})$ time, for every input $\mathcal{P}$ and every fixed $d,k\geq 1$ and $\varepsilon \in (0,1)$. The result easily generalized for any metric space, distances to the power of $z>0$, and M-estimators that handle outliers. Applying an inefficient but optimal algorithm on this coreset allows us to obtain the first PTAS ($1+\varepsilon$ approximation) for the sets-$k$-means problem that takes time near linear in $n$. This is the first result even for sets-mean on the plane ($k=1$, $d=2$). Open source code and experimental results for document classification and facility locations are also provided.