论文标题

隐私保证可以取消识别文本转换

Privacy Guarantees for De-identifying Text Transformations

论文作者

Adelani, David Ifeoluwa, Davody, Ali, Kleinbauer, Thomas, Klakow, Dietrich

论文摘要

自然语言处理任务的机器学习方法受益于全面的现实生活用户数据。同时,显然需要保护收集和处理数据的用户的隐私。对于文本收集,例如语音互动或患者记录的笔录,用良性替代方案代替敏感零件可以提供识别。但是,这种文本转换实际上保证了多少隐私,而所产生的文本对于机器学习仍然有用吗?在本文中,我们根据差异隐私提供了正式的隐私保证基于常规文本转换的去识别方法。我们还衡量的是,对话框成绩单中掩盖私人信息的不同方式对随后的机器学习任务具有。为此,我们制定了不同的掩盖策略,并比较了他们的隐私私人权衡权衡。特别是,我们将一种简单的编辑方法与更复杂的单词替换进行了比较,使用深度学习模型在多种自然语言理解任务上,例如命名实体识别,意图检测和对话框ACT分类。我们发现,在各种任务中,只有逐字替换才能强大的性能下降。

Machine Learning approaches to Natural Language Processing tasks benefit from a comprehensive collection of real-life user data. At the same time, there is a clear need for protecting the privacy of the users whose data is collected and processed. For text collections, such as, e.g., transcripts of voice interactions or patient records, replacing sensitive parts with benign alternatives can provide de-identification. However, how much privacy is actually guaranteed by such text transformations, and are the resulting texts still useful for machine learning? In this paper, we derive formal privacy guarantees for general text transformation-based de-identification methods on the basis of Differential Privacy. We also measure the effect that different ways of masking private information in dialog transcripts have on a subsequent machine learning task. To this end, we formulate different masking strategies and compare their privacy-utility trade-offs. In particular, we compare a simple redact approach with more sophisticated word-by-word replacement using deep learning models on multiple natural language understanding tasks like named entity recognition, intent detection, and dialog act classification. We find that only word-by-word replacement is robust against performance drops in various tasks.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源