紧密区域内的固有维度估计：理论和实验分析

论文标题

紧密区域内的固有维度估计：理论和实验分析

Intrinsic Dimensionality Estimation within Tight Localities: A Theoretical and Experimental Analysis

论文作者

Amsaleg, Laurent, Chelly, Oussama, Houle, Michael E., Kawarabayashi, Ken-ichi, Radovanović, Miloš, Treeratanajaru, Weeris

论文摘要

在许多数据挖掘和机器学习任务中，固有维度（ID）的准确估计至关重要，包括降低维度，离群检测，相似性搜索和子空间群集。但是，由于它们的收敛性通常需要数百个点的样本量（即邻域尺寸），因此现有的ID估计方法可能仅对数据组成的应用程序组成的应用程序有限。在本文中，我们提出了一个局部ID估计策略，即使对于“紧密”的地方，包括20个样本点。估计器基于最新的固有维度（局部固有维度（LID））的极端价值理论模型，在样品成员之间的所有可用成对距离上应用MLE技术。我们的实验结果表明，我们提出的估计技术可以达到明显更小的方差，同时保持可比的偏差水平，而样本量比最先进的估计器小得多。

Accurate estimation of Intrinsic Dimensionality (ID) is of crucial importance in many data mining and machine learning tasks, including dimensionality reduction, outlier detection, similarity search and subspace clustering. However, since their convergence generally requires sample sizes (that is, neighborhood sizes) on the order of hundreds of points, existing ID estimation methods may have only limited usefulness for applications in which the data consists of many natural groups of small size. In this paper, we propose a local ID estimation strategy stable even for `tight' localities consisting of as few as 20 sample points. The estimator applies MLE techniques over all available pairwise distances among the members of the sample, based on a recent extreme-value-theoretic model of intrinsic dimensionality, the Local Intrinsic Dimension (LID). Our experimental results show that our proposed estimation technique can achieve notably smaller variance, while maintaining comparable levels of bias, at much smaller sample sizes than state-of-the-art estimators.

下载PDF全文

下载文献需遵守相关版权规定

论文标题