论文标题

建立数据隐私预测性能权衡

Towards a Data Privacy-Predictive Performance Trade-off

论文作者

Carvalho, Tânia, Moniz, Nuno, Faria, Pedro, Antunes, Luís

论文摘要

机器学习越来越多地用于最多样化的应用程序和领域,无论是在医疗保健中,预测病理学还是在金融部门检测欺诈行为。数据实用程序是机器学习效率和准确性的Linchpins之一。但是,当它包含个人信息时,由于旨在保护个人隐私的法律和法规,可能会受到全面访问的限制。因此,数据所有者必须确保任何共享的数据保证这种隐私。私人信息的删除或转换是最常见的技术之一。直觉上,人们可以预期减少细节或扭曲信息将导致模型预测性能的损失。但是,使用去识别数据进行分类任务的先前工作通常表明可以保留在特定应用程序中的预测性能。在本文中,我们旨在评估分类任务中数据隐私与预测性能之间的权衡。我们利用大量隐私保护技术和学习算法来评估重新识别能力以及转化变体对预测性能的影响。与以前的文献不同,我们确认隐私水平越高(重新识别风险较低),对预测性能的影响越高,则指向清晰的权衡证据。

Machine learning is increasingly used in the most diverse applications and domains, whether in healthcare, to predict pathologies, or in the financial sector to detect fraud. One of the linchpins for efficiency and accuracy in machine learning is data utility. However, when it contains personal information, full access may be restricted due to laws and regulations aiming to protect individuals' privacy. Therefore, data owners must ensure that any data shared guarantees such privacy. Removal or transformation of private information (de-identification) are among the most common techniques. Intuitively, one can anticipate that reducing detail or distorting information would result in losses for model predictive performance. However, previous work concerning classification tasks using de-identified data generally demonstrates that predictive performance can be preserved in specific applications. In this paper, we aim to evaluate the existence of a trade-off between data privacy and predictive performance in classification tasks. We leverage a large set of privacy-preserving techniques and learning algorithms to provide an assessment of re-identification ability and the impact of transformed variants on predictive performance. Unlike previous literature, we confirm that the higher the level of privacy (lower re-identification risk), the higher the impact on predictive performance, pointing towards clear evidence of a trade-off.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源