用于序列标签和法规标识的细粒度中国软件隐私策略数据集

论文标题

用于序列标签和法规标识的细粒度中国软件隐私策略数据集

A Fine-grained Chinese Software Privacy Policy Dataset for Sequence Labeling and Regulation Compliant Identification

论文作者

Zhao, Kaifa, Yu, Le, Zhou, Shiyao, Li, Jing, Luo, Xiapu, Chiu, Yat Fei Aemon, Liu, Yutong

论文摘要

隐私保护引起了法律层面和用户意识的极大关注。为了保护用户隐私，各国制定了要求软件隐私政策来规范其行为的法律和法规。但是，隐私政策是用具有许多法律条款和软件术语的自然语言编写的，可以阻止用户理解甚至阅读它们。希望使用NLP技术来分析帮助用户理解它们的隐私政策。此外，现有数据集忽略法律要求，而仅限于英语。在本文中，我们构建了第一个中国隐私政策数据集，即CA4P-483，以促进序列标签任务和隐私策略和软件之间的法规合规性识别。我们的数据集包括483个中国Android应用程序隐私政策，超过11K的句子和52K细粒注释。我们在数据集中评估了健壮和代表性基线模型的家庭。根据基线性能，我们在数据集中提供了发现和潜在的研究方向。最后，我们研究了CA4P-483梳理要求和计划分析的潜在应用。

Privacy protection raises great attention on both legal levels and user awareness. To protect user privacy, countries enact laws and regulations requiring software privacy policies to regulate their behavior. However, privacy policies are written in natural languages with many legal terms and software jargon that prevent users from understanding and even reading them. It is desirable to use NLP techniques to analyze privacy policies for helping users understand them. Furthermore, existing datasets ignore law requirements and are limited to English. In this paper, we construct the first Chinese privacy policy dataset, namely CA4P-483, to facilitate the sequence labeling tasks and regulation compliance identification between privacy policies and software. Our dataset includes 483 Chinese Android application privacy policies, over 11K sentences, and 52K fine-grained annotations. We evaluate families of robust and representative baseline models on our dataset. Based on baseline performance, we provide findings and potential research directions on our dataset. Finally, we investigate the potential applications of CA4P-483 combing regulation requirements and program analysis.

下载PDF全文

下载文献需遵守相关版权规定

论文标题