核密度通过密度估算的密度约束，靠近邻居搜索

论文标题

核密度通过密度估算的密度约束，靠近邻居搜索

Kernel Density Estimation through Density Constrained Near Neighbor Search

论文作者

Charikar, Moses, Kapralov, Michael, Nouri, Navid, Siminelakis, Paris

论文摘要

在本文中，我们重新讨论了内核密度估计问题：给定一个内核$ k（x，y）$和一个$ n $点的数据集，在高维欧几里得空间中的$ n $点，准备可以快速输出的数据结构，给定查询$ q $，$ q $，a $（$（1+ε）$ - 大约到$：= = \ frac p p p p p | pp1} Q）$。首先，我们基于经典的近邻居搜索技术提供了一个单个数据结构，该技术改进或基本上与迄今为止文献中考虑的所有径向核的查询时间和空间复杂性相匹配。然后，我们通过使用数据依赖性邻居搜索的最新进展来展示如何提高查询复杂性和运行时。我们通过对自然重要性采样方案的新实施来实现我们的结果。与以前的方法不同，我们的算法首先对数据集进行了均匀的样品（考虑采样率的几何序列），然后在所得较小的数据集上使用现有的近似邻居搜索技术来检索与查询相适当距离的采样点。我们表明，所得的采样数据集具有强大的几何结构，使附近的邻居搜索返回所需的样本效率要比相同大小的最坏情况数据集更有效。作为示例应用程序，我们表明该方法产生了一个数据结构，该数据结构可实现查询时间$μ^{ - （1+o（1）/4} $和空间复杂性$μ^{ - （1+o（1））} $用于高斯内核。我们的数据依赖方法达到了查询时间$μ^{ - 0.173-o（1）} $和空间$μ^{ - （1+o（1））} $用于高斯内核。数据依赖分析依赖于在递归哈希过程中跟踪输入数据集的几何结构的新技术，我们希望我们希望在近邻居搜索中对其他应用程序感兴趣。

In this paper we revisit the kernel density estimation problem: given a kernel $K(x, y)$ and a dataset of $n$ points in high dimensional Euclidean space, prepare a data structure that can quickly output, given a query $q$, a $(1+ε)$-approximation to $μ:=\frac1{|P|}\sum_{p\in P} K(p, q)$. First, we give a single data structure based on classical near neighbor search techniques that improves upon or essentially matches the query time and space complexity for all radial kernels considered in the literature so far. We then show how to improve both the query complexity and runtime by using recent advances in data-dependent near neighbor search. We achieve our results by giving a new implementation of the natural importance sampling scheme. Unlike previous approaches, our algorithm first samples the dataset uniformly (considering a geometric sequence of sampling rates), and then uses existing approximate near neighbor search techniques on the resulting smaller dataset to retrieve the sampled points that lie at an appropriate distance from the query. We show that the resulting sampled dataset has strong geometric structure, making approximate near neighbor search return the required samples much more efficiently than for worst case datasets of the same size. As an example application, we show that this approach yields a data structure that achieves query time $μ^{-(1+o(1))/4}$ and space complexity $μ^{-(1+o(1))}$ for the Gaussian kernel. Our data dependent approach achieves query time $μ^{-0.173-o(1)}$ and space $μ^{-(1+o(1))}$ for the Gaussian kernel. The data dependent analysis relies on new techniques for tracking the geometric structure of the input datasets in a recursive hashing process that we hope will be of interest in other applications in near neighbor search.

下载PDF全文

下载文献需遵守相关版权规定

论文标题