异步差异私有机器学习中的隐私成本

论文标题

异步差异私有机器学习中的隐私成本

The Cost of Privacy in Asynchronous Differentially-Private Machine Learning

论文作者

Farokhi, Farhad, Wu, Nan, Smith, David, Kaafar, Mohamed Ali

论文摘要

我们考虑使用位于具有不同隐私设置的多个私人和地理界面服务器上的培训数据培训机器学习模型。由于数据的分布性质，与所有协作的私人数据所有者进行沟通可能会证明具有挑战性或完全不可能。在本文中，我们开发了差异性异步算法，用于在多个私人数据集上进行协作培训机器学习模型。算法的异步性质意味着，每当可以进行通信时，中央学习者都会与私人数据所有者进行互动，而无需汇总查询响应以构建整个健身功能的梯度。因此，该算法有效地扩展到许多数据所有者。我们将隐私的成本定义为具有隐私的机器学习模型的适应性与在没有隐私问题的情况下训练有素的机器学习模型的适应性之间的差异。我们证明我们可以预测拟议的隐私权异步算法的性能。我们证明，隐私成本具有上限，该限制与培训数据集平方的组合大小和平方的隐私预算的总和成反比。我们通过对财务和医疗数据集的实验来验证理论结果。该实验表明，与仅在一个数据集中孤立培训的模型相比，具有至少10,000个具有大于或等于1的数据的数据所有者的合作，其隐私预算大于或等于1的结果，这是一个卓越的机器学习模型。如果隐私预算较高，则可以降低协作数据集的数量。

We consider training machine learning models using Training data located on multiple private and geographically-scattered servers with different privacy settings. Due to the distributed nature of the data, communicating with all collaborating private data owners simultaneously may prove challenging or altogether impossible. In this paper, we develop differentially-private asynchronous algorithms for collaboratively training machine-learning models on multiple private datasets. The asynchronous nature of the algorithms implies that a central learner interacts with the private data owners one-on-one whenever they are available for communication without needing to aggregate query responses to construct gradients of the entire fitness function. Therefore, the algorithm efficiently scales to many data owners. We define the cost of privacy as the difference between the fitness of a privacy-preserving machine-learning model and the fitness of trained machine-learning model in the absence of privacy concerns. We prove that we can forecast the performance of the proposed privacy-preserving asynchronous algorithms. We demonstrate that the cost of privacy has an upper bound that is inversely proportional to the combined size of the training datasets squared and the sum of the privacy budgets squared. We validate the theoretical results with experiments on financial and medical datasets. The experiments illustrate that collaboration among more than 10 data owners with at least 10,000 records with privacy budgets greater than or equal to 1 results in a superior machine-learning model in comparison to a model trained in isolation on only one of the datasets, illustrating the value of collaboration and the cost of the privacy. The number of the collaborating datasets can be lowered if the privacy budget is higher.

下载PDF全文

下载文献需遵守相关版权规定

论文标题