论文标题

有效合成数据的一步

One Step to Efficient Synthetic Data

论文作者

Awan, Jordan, Cai, Zhanrui

论文摘要

合成数据的一种常见方法是从拟合模型中采样。我们表明,在一般的假设下,这种方法导致样本效率低下,其联合分布与真实分布不一致。由此激励,我们提出了一种生成合成数据的通用方法,该方法广泛适用于参数模型,具有渐近效率的摘要统计信息,并且既易于实施又有高度的计算效率。我们的方法允许构建两个部分合成数据集,这些数据集可保留某些摘要统计数据,并具有完全合成的数据,这些数据可以满足差异隐私(DP)的强大保证,均具有相同的渐近保证。我们还提供了理论和经验证据,表明我们的程序分布会融合到真实分布。除了我们对合成数据的关注外,我们的程序还可以在存在棘手的似然函数的情况下进行近似假设检验。

A common approach to synthetic data is to sample from a fitted model. We show that under general assumptions, this approach results in a sample with inefficient estimators and whose joint distribution is inconsistent with the true distribution. Motivated by this, we propose a general method of producing synthetic data, which is widely applicable for parametric models, has asymptotically efficient summary statistics, and is both easily implemented and highly computationally efficient. Our approach allows for the construction of both partially synthetic datasets, which preserve certain summary statistics, as well as fully synthetic data which satisfy the strong guarantee of differential privacy (DP), both with the same asymptotic guarantees. We also provide theoretical and empirical evidence that the distribution from our procedure converges to the true distribution. Besides our focus on synthetic data, our procedure can also be used to perform approximate hypothesis tests in the presence of intractable likelihood functions.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源