论文标题
了解联邦学习中的意外记忆
Understanding Unintended Memorization in Federated Learning
论文作者
论文摘要
最近的作品表明,生成序列模型(例如语言模型)具有记住训练数据中稀有或独特序列的趋势。由于经常在敏感数据上培训有用的模型,以确保培训数据的隐私,因此识别和减轻这种意外记忆至关重要。联合学习(FL)已成为大规模分布式学习任务的新型框架。但是,它在许多方面与研究充分的中央学习环境有所不同,其中所有数据都存储在中央服务器上。在本文中,我们启动了一项正式研究,以了解与中央学习环境相比,在训练有素的模型中,规范FL的不同组成部分对意外记忆的影响。我们的结果表明,FL的几个不同组成部分在减少意外记忆中起着重要作用。具体来说,我们观察到,根据用户的数据聚类 - 通过fl中的设计发生的数据在减少这种记忆方面具有重大影响,并且使用联合平均进行培训的方法会导致进一步减少。我们还表明,具有强大的用户级差异隐私保证的培训会导致表现出最少意外记忆的模型。
Recent works have shown that generative sequence models (e.g., language models) have a tendency to memorize rare or unique sequences in the training data. Since useful models are often trained on sensitive data, to ensure the privacy of the training data it is critical to identify and mitigate such unintended memorization. Federated Learning (FL) has emerged as a novel framework for large-scale distributed learning tasks. However, it differs in many aspects from the well-studied central learning setting where all the data is stored at the central server. In this paper, we initiate a formal study to understand the effect of different components of canonical FL on unintended memorization in trained models, comparing with the central learning setting. Our results show that several differing components of FL play an important role in reducing unintended memorization. Specifically, we observe that the clustering of data according to users---which happens by design in FL---has a significant effect in reducing such memorization, and using the method of Federated Averaging for training causes a further reduction. We also show that training with a strong user-level differential privacy guarantee results in models that exhibit the least amount of unintended memorization.