合成数据 - 匿名土拨鼠日

论文标题

合成数据 - 匿名土拨鼠日

Synthetic Data -- Anonymisation Groundhog Day

论文作者

Stadler, Theresa, Oprisanu, Bristena, Troncoso, Carmela

论文摘要

综合数据已被广告成为银行 - 布列特解决方案，以解决隐私性数据发布，以解决传统匿名技术的缺点。承诺是，从生成模型中绘制的合成数据保留了原始数据集的统计属性，但同时，可以完美地保护隐私攻击。在这项工作中，我们介绍了合成数据发布的隐私获得的第一个定量评估，并将其与以前的匿名技术进行了比较。我们对各种最先进的生成模型的评估表明，合成数据要么不能防止推理攻击或不保留数据实用性。换句话说，我们从经验上表明，与传统的匿名技术相比，综合数据在隐私和效用之间没有更好的权衡。此外，与传统的匿名化相反，综合数据发布的隐私 - 实用性权衡很难预测。由于无法预测合成数据集将保留的信号以及将丢失的信息，因此合成数据会导致高度可变的隐私收益和无法预测的效用损失。总而言之，我们发现合成数据远非隐私数据发布的圣杯。

Synthetic data has been advertised as a silver-bullet solution to privacy-preserving data publishing that addresses the shortcomings of traditional anonymisation techniques. The promise is that synthetic data drawn from generative models preserves the statistical properties of the original dataset but, at the same time, provides perfect protection against privacy attacks. In this work, we present the first quantitative evaluation of the privacy gain of synthetic data publishing and compare it to that of previous anonymisation techniques. Our evaluation of a wide range of state-of-the-art generative models demonstrates that synthetic data either does not prevent inference attacks or does not retain data utility. In other words, we empirically show that synthetic data does not provide a better tradeoff between privacy and utility than traditional anonymisation techniques. Furthermore, in contrast to traditional anonymisation, the privacy-utility tradeoff of synthetic data publishing is hard to predict. Because it is impossible to predict what signals a synthetic dataset will preserve and what information will be lost, synthetic data leads to a highly variable privacy gain and unpredictable utility loss. In summary, we find that synthetic data is far from the holy grail of privacy-preserving data publishing.

下载PDF全文

下载文献需遵守相关版权规定

论文标题