设置聚类

论文标题

Sets Clustering

论文作者

Jubran, Ibrahim, Tukan, Murad, Maalouf, Alaa, Feldman, Dan

论文摘要

\ emph {sets- $ k $ -means}问题的输入是一个整数$ k \ geq 1 $和set $ \ mathcal {p} = \ {p_1，\ cdots，p_n \} $ in $ \ setbb {r} r}^d $。目标是在$ \ mathbb {r}^d $中计算$ k $中心的$ c $ c $ c $ c $ k $中心（点），以最大程度地减少sum $ \ sum_ {p \ in \ mathcal {p}}} \ min_ {p \ in c} P-C \ right \ |^2 $的平方距离。对于此问题，一个\ emph {$ \ varepsilon $ -core-set}是$ \ nathcal {p} $的加权子集，该子集近似于此总和$ 1 \ pm \ pm \ pm \ varepsilon $ action，对于\ emph {emph {every} set $ k $ c $ c $ c $ c $ c $ c $ c $ c $ c $ c $ c $ c $ c $ c $ c $ c $ c $ c $ c $ c $ c $ \ mathbbbbbbbb = $} $}我们证明，这种核心设置为$ o（\ log^2 {n}）$ sets始终存在，并且可以以$ o（n \ log {n}）$时间计算，对于每个输入$ \ Mathcal {p} $和每个已固定的$ d，k \ egeq 1 $和$ \ varepsilon \ in（0,1,1）$。结果很容易在任何度量空间，$ z> 0 $的功率的距离以及处理异常值的M-估计器中进行概括。在此核心上应用效率低下但最佳的算法使我们能够获得第一个PTA（$ 1+\ varepsilon $近似），用于sets-$ k $ -means问题，该问题在$ n $中的线性接近。即使对于飞机上的集合均值，这也是第一个结果（$ k = 1 $，$ d = 2 $）。还提供了用于文档分类和设施位置的开源代码和实验结果。

The input to the \emph{sets-$k$-means} problem is an integer $k\geq 1$ and a set $\mathcal{P}=\{P_1,\cdots,P_n\}$ of sets in $\mathbb{R}^d$. The goal is to compute a set $C$ of $k$ centers (points) in $\mathbb{R}^d$ that minimizes the sum $\sum_{P\in \mathcal{P}} \min_{p\in P, c\in C}\left\| p-c \right\|^2$ of squared distances to these sets. An \emph{$\varepsilon$-core-set} for this problem is a weighted subset of $\mathcal{P}$ that approximates this sum up to $1\pm\varepsilon$ factor, for \emph{every} set $C$ of $k$ centers in $\mathbb{R}^d$. We prove that such a core-set of $O(\log^2{n})$ sets always exists, and can be computed in $O(n\log{n})$ time, for every input $\mathcal{P}$ and every fixed $d,k\geq 1$ and $\varepsilon \in (0,1)$. The result easily generalized for any metric space, distances to the power of $z>0$, and M-estimators that handle outliers. Applying an inefficient but optimal algorithm on this coreset allows us to obtain the first PTAS ($1+\varepsilon$ approximation) for the sets-$k$-means problem that takes time near linear in $n$. This is the first result even for sets-mean on the plane ($k=1$, $d=2$). Open source code and experimental results for document classification and facility locations are also provided.

下载PDF全文

下载文献需遵守相关版权规定

论文标题

设置聚类

Sets Clustering

论文作者

论文摘要

加入微信交流群