论文标题
引导文本匿名模型具有遥远的监督
Bootstrapping Text Anonymization Models with Distant Supervision
论文作者
论文摘要
我们提出了一种基于遥远监督的新方法来引导文本匿名模型。该方法不需要手动标记的培训数据,而是依赖于表达背景信息的知识图,该信息被认为是有关各个人的公开可用的。该知识图用于自动注释文本文档,包括有关这些个人子集的个人数据。更确切地说,该方法确定应该掩盖哪些文本跨度,以保证$ k $ - 匿名性,假设具有对文本文档访问的对手和知识图中表达的背景信息。然后,将所得的标记文档集合用作培训数据,以微调用于文本匿名化的预训练的语言模型。我们使用从Wikidata提取的知识图和Wikipedia简短的传记文本来说明这种方法。评估结果具有基于罗伯塔的模型和553个摘要的手动注释集合,展示了该方法的潜力,但如果知识图是嘈杂或不完整的,则可能会出现许多问题。结果还表明,与大多数序列标记问题相反,文本匿名任务可能会允许几种替代解决方案。
We propose a novel method to bootstrap text anonymization models based on distant supervision. Instead of requiring manually labeled training data, the approach relies on a knowledge graph expressing the background information assumed to be publicly available about various individuals. This knowledge graph is employed to automatically annotate text documents including personal data about a subset of those individuals. More precisely, the method determines which text spans ought to be masked in order to guarantee $k$-anonymity, assuming an adversary with access to both the text documents and the background information expressed in the knowledge graph. The resulting collection of labeled documents is then used as training data to fine-tune a pre-trained language model for text anonymization. We illustrate this approach using a knowledge graph extracted from Wikidata and short biographical texts from Wikipedia. Evaluation results with a RoBERTa-based model and a manually annotated collection of 553 summaries showcase the potential of the approach, but also unveil a number of issues that may arise if the knowledge graph is noisy or incomplete. The results also illustrate that, contrary to most sequence labeling problems, the text anonymization task may admit several alternative solutions.