论文标题

$ k $ neart邻居问题的有效分布式算法问题

Efficient Distributed Algorithms for the $K$-Nearest Neighbors Problem

论文作者

Fathi, Reza, Molla, Anisur Rahaman, Pandurangan, Gopal

论文摘要

$ k $ - 最近的邻居是机器学习的基本问题。在此问题中,给定标签和查询点$ p $的(培训)$ n $数据点,我们希望根据$ k $ neart-neart点的标签将标签分配给$ p $。我们在{\ em $ k $ -machine模型}中研究此问题(请注意,参数$ k $代表$ k $ - 摩碱模型中的机器数量,并且独立于$ k $ neart的点。)用于分布式大规模数据的模型。在此型号中,我们假设$ n $点是在$ k $机器中分布(以平衡的方式),目标是在对机器的查询点上快速计算答案。 我们的主要结果是在$ k $ - 机器模型中的一种简单的随机算法,该算法在$ o(\ log k)$通信弹中运行,并具有很高的可能性(无论机器的数量$ k $和点$ n $的数量如何)。该算法的消息复杂性仅用于$ O(k \ log k)$消息。我们的界限本​​质上是基于比较的算法(仅使用比较操作($ \ leq,\ geq,= $)的算法(算法),以区分它们之间的顺序)。这是由于存在$ω(\ log n)$通信弹的下限,以查找$ 2N $元素的{\ em中位数},该$ 2N $元素均匀分布在两个处理器中,由Rodeh \ cite {rodeh}在两个处理器中分布。 我们还实施了算法,并表明它的性能与算法(在实践中使用)相比,该算法将最接近每台机器的$ K $发送到一台机器,然后计算答案。

The $K$-nearest neighbors is a basic problem in machine learning with numerous applications. In this problem, given a (training) set of $n$ data points with labels and a query point $p$, we want to assign a label to $p$ based on the labels of the $K$-nearest points to the query. We study this problem in the {\em $k$-machine model}, (Note that parameter $k$ stands for the number of machines in the $k$-machine model and is independent of $K$-nearest points.) a model for distributed large-scale data. In this model, we assume that the $n$ points are distributed (in a balanced fashion) among the $k$ machines and the goal is to quickly compute answer given a query point to a machine. Our main result is a simple randomized algorithm in the $k$-machine model that runs in $O(\log K)$ communication rounds with high probability success (regardless of the number of machines $k$ and the number of points $n$). The message complexity of the algorithm is small taking only $O(k\log K)$ messages. Our bounds are essentially the best possible for comparison-based algorithms (Algorithms that use only comparison operations ($\leq, \geq, =$) between elements to distinguish the ordering among them). This is due to the existence of a lower bound of $Ω(\log n)$ communication rounds for finding the {\em median} of $2n$ elements distributed evenly among two processors by Rodeh \cite{rodeh}. We also implemented our algorithm and show that it performs well compared to an algorithm (used in practice) that sends $K$ nearest points from each machine to a single machine which then computes the answer.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源