论文标题

“ hassignification()”:une nouvelle fanction de decount pour soutenir ladétectiondedonnées人物

"hasSignification()": une nouvelle fonction de distance pour soutenir la détection de données personnelles

论文作者

Mrabet, Amine, Hassan, Ali, Darmon, Patrice

论文摘要

如今,有了大数据和数据湖泊,我们面临着大量数据,这些数据很难手动管理。在这种情况下,对个人数据的保护需要自动分析数据发现。存储在知识库中已经分析的属性的名称可以优化此自动发现。要拥有更好的知识库,我们不应存储任何名称没有意义的属性。在本文中,要检查属性的名称是否具有含义,我们提出了一个解决方案来计算此名称和字典中的单词之间的距离。我们对距离的研究功能,例如N-gram,Jaro-Winkler和Levenshtein显示出限制,以设定知识库中属性的接受阈值。为了克服这些局限性,我们的解决方案旨在通过基于最长序列使用指数函数来增强得分计算。此外,还提出了词典中的双扫描,以处理具有复合名称的属性。

Today with Big Data and data lakes, we are faced of a mass of data that is very difficult to manage it manually. The protection of personal data in this context requires an automatic analysis for data discovery. Storing the names of attributes already analyzed in a knowledge base could optimize this automatic discovery. To have a better knowledge base, we should not store any attributes whose name does not make sense. In this article, to check if the name of an attribute has a meaning, we propose a solution that calculate the distances between this name and the words in a dictionary. Our studies on the distance functions like N-Gram, Jaro-Winkler and Levenshtein show limits to set an acceptance threshold for an attribute in the knowledge base. In order to overcome these limitations, our solution aims to strengthen the score calculation by using an exponential function based on the longest sequence. In addition, a double scan in dictionary is also proposed in order to process the attributes which have a compound name.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源