将合成数据的效用和披露风险与微数据样品进行比较

论文标题

将合成数据的效用和披露风险与微数据样品进行比较

Comparing the Utility and Disclosure Risk of Synthetic Data with Samples of Microdata

论文作者

Little, Claire, Elliot, Mark, Allmendinger, Richard

论文摘要

大多数统计机构释放了随机选择的人口普查微数据样本，通常在10％以下的样本分数以及其他形式的统计披露控制（SDC）。 SDC的一种替代方法是数据综合，它一直引起人们日益增长的兴趣，但是如何衡量如何衡量相关的效用和数据的披露风险有明确的共识。清楚地了解了效用和相关风险的合成人口普查微数据的能力，可能意味着可以更及时，更广泛地访问微型数据。本文遵循作者先前的工作，这些作者映射了关于风险效用（R-U）图的合成普查数据。本文提出了一个框架，通过将合成数据的效用和披露风险进行比较，通过将其与不同样品分数的原始数据进行比较，从而识别出与合成数据具有同等用途和风险的样本分数。将三个常用的数据合成软件包与一些有趣的结果进行了比较。在多个方向上需要进一步的工作，但是该方法看起来非常有前途。

Most statistical agencies release randomly selected samples of Census microdata, usually with sample fractions under 10% and with other forms of statistical disclosure control (SDC) applied. An alternative to SDC is data synthesis, which has been attracting growing interest, yet there is no clear consensus on how to measure the associated utility and disclosure risk of the data. The ability to produce synthetic Census microdata, where the utility and associated risks are clearly understood, could mean that more timely and wider-ranging access to microdata would be possible. This paper follows on from previous work by the authors which mapped synthetic Census data on a risk-utility (R-U) map. The paper presents a framework to measure the utility and disclosure risk of synthetic data by comparing it to samples of the original data of varying sample fractions, thereby identifying the sample fraction which has equivalent utility and risk to the synthetic data. Three commonly used data synthesis packages are compared with some interesting results. Further work is needed in several directions but the methodology looks very promising.

下载PDF全文

下载文献需遵守相关版权规定

论文标题