论文标题

GENSYN:使用宏数据源生成合成微型数据的多阶段框架

GenSyn: A Multi-stage Framework for Generating Synthetic Microdata using Macro Data Sources

论文作者

Acharya, Angeela, Sikdar, Siddhartha, Das, Sanmay, Rangwala, Huzefa

论文摘要

表征人群的个人级别数据(微数据)对于研究许多现实世界中的问题至关重要。但是,由于成本和隐私限制,获取此类数据并不简单,并且访问通常仅限于汇总数据(宏数据)来源。在这项研究中,我们将合成数据生成作为一种工具,通过结合来自多个易于访问的低分辨率数据源的信息来推断难以指出的高分辨率数据。特别是,我们介绍了一个框架,该框架结合了从给定目标地理位置的单变量和多变量频率表,并结合来自其他辅助位置的频率表来生成目标位置中个体的合成微型数据。我们的方法结合了依赖图图的估计和来自目标位置的条件概率,并使用高斯副群来利用辅助位置的可用信息。我们对两个现实世界数据集进行了广泛的测试,并证明我们的方法在保留数据的整体依赖关系结构方面都超过了先验方法,同时还满足了在不同变量上定义的约束。

Individual-level data (microdata) that characterizes a population, is essential for studying many real-world problems. However, acquiring such data is not straightforward due to cost and privacy constraints, and access is often limited to aggregated data (macro data) sources. In this study, we examine synthetic data generation as a tool to extrapolate difficult-to-obtain high-resolution data by combining information from multiple easier-to-obtain lower-resolution data sources. In particular, we introduce a framework that uses a combination of univariate and multivariate frequency tables from a given target geographical location in combination with frequency tables from other auxiliary locations to generate synthetic microdata for individuals in the target location. Our method combines the estimation of a dependency graph and conditional probabilities from the target location with the use of a Gaussian copula to leverage the available information from the auxiliary locations. We perform extensive testing on two real-world datasets and demonstrate that our approach outperforms prior approaches in preserving the overall dependency structure of the data while also satisfying the constraints defined on the different variables.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源