在有效的内核密度估计中选择代表性子样本的最佳传输方法

论文标题

在有效的内核密度估计中选择代表性子样本的最佳传输方法

An optimal transport approach for selecting a representative subsample with application in efficient kernel density estimation

论文作者

Zhang, Jingyi, Meng, Cheng, Yu, Jun, Zhang, Mengrui, Zhong, Wenxuan, Ma, Ping

论文摘要

亚采样方法旨在选择一个子样本作为观察到的样品的替代物。近几十年来，这种方法已在大规模数据分析，主动学习和保护隐私分析中普遍使用。在本文中，我们研究了无模型的亚采样方法，而不是基于模型的方法，该方法旨在识别不受模型假设限制的子样本。现有的无模型亚采样方法通常是基于聚类技术或内核技巧构建的。这些方法中的大多数都遭受了巨大的计算负担或理论上的弱点。特别是，理论上的弱点是所选子样本的经验分布可能不一定会融合到种群分布。这种计算和理论局限性阻碍了实践中无模型子采样方法的广泛适用性。我们通过利用最佳运输技术提出了一种新颖的无模型亚采样方法。此外，我们开发了一种适应未知概率密度函数的有效亚采样算法。从理论上讲，我们显示所选的子样本可通过得出所提出的子样本核密度估计器的收敛速率来用于有效的密度估计。我们还为提出的估计器提供了最佳带宽。关于合成和现实世界数据集的数值研究表明，该方法的性能是卓越的。

Subsampling methods aim to select a subsample as a surrogate for the observed sample. Such methods have been used pervasively in large-scale data analytics, active learning, and privacy-preserving analysis in recent decades. Instead of model-based methods, in this paper, we study model-free subsampling methods, which aim to identify a subsample that is not confined by model assumptions. Existing model-free subsampling methods are usually built upon clustering techniques or kernel tricks. Most of these methods suffer from either a large computational burden or a theoretical weakness. In particular, the theoretical weakness is that the empirical distribution of the selected subsample may not necessarily converge to the population distribution. Such computational and theoretical limitations hinder the broad applicability of model-free subsampling methods in practice. We propose a novel model-free subsampling method by utilizing optimal transport techniques. Moreover, we develop an efficient subsampling algorithm that is adaptive to the unknown probability density function. Theoretically, we show the selected subsample can be used for efficient density estimation by deriving the convergence rate for the proposed subsample kernel density estimator. We also provide the optimal bandwidth for the proposed estimator. Numerical studies on synthetic and real-world datasets demonstrate the performance of the proposed method is superior.

下载PDF全文

下载文献需遵守相关版权规定

论文标题