论文标题

带有分层结构的私人合成数据

Private Synthetic Data with Hierarchical Structure

论文作者

Liu, Terrance, Wu, Zhiwei Steven

论文摘要

我们研究了分层数据集的差异私有合成数据生成的问题,其中各个数据点被分组在一起(例如,家庭中的人)。特别是,为了衡量合成数据集与基础私有数据集之间的相似性,我们在私人查询释放问题下将目标构架,生成一个合成数据集,该数据集为某些查询收集(即统计数据等统计数据,如平均汇总计数)提供了答案。但是,虽然对私人合成数据的应用在查询释放问题中的应用进行了充分的研究,但此类研究仅限于非层次数据域,提出了最初的问题 - 在考虑这种形式的数据时,哪些查询很重要?此外,尚未确定如何在捕获此类统计数据的同时,如何在组和个体级别上生成合成数据。鉴于这些挑战,我们首先正式化了层次查询发行的问题,在该问题中,目标是为某些分层数据集发布统计数据集。具体而言,我们提供了一组一般的统计查询,这些查询捕获了组和个体级别的属性之间的关系。随后,我们引入了私人合成数据算法,以进行分层查询发布,并在美国社区调查和Allegheny家庭筛查工具数据中得出的层次数据集进行评估。最后,我们研究了美国社区调查,其固有的层次结构产生了我们进行的另一组特定领域的查询。

We study the problem of differentially private synthetic data generation for hierarchical datasets in which individual data points are grouped together (e.g., people within households). In particular, to measure the similarity between the synthetic dataset and the underlying private one, we frame our objective under the problem of private query release, generating a synthetic dataset that preserves answers for some collection of queries (i.e., statistics like mean aggregate counts). However, while the application of private synthetic data to the problem of query release has been well studied, such research is restricted to non-hierarchical data domains, raising the initial question -- what queries are important when considering data of this form? Moreover, it has not yet been established how one can generate synthetic data at both the group and individual-level while capturing such statistics. In light of these challenges, we first formalize the problem of hierarchical query release, in which the goal is to release a collection of statistics for some hierarchical dataset. Specifically, we provide a general set of statistical queries that captures relationships between attributes at both the group and individual-level. Subsequently, we introduce private synthetic data algorithms for hierarchical query release and evaluate them on hierarchical datasets derived from the American Community Survey and Allegheny Family Screening Tool data. Finally, we look to the American Community Survey, whose inherent hierarchical structure gives rise to another set of domain-specific queries that we run experiments with.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源