论文标题

金融机构的高通量神经网络模型的敏感数据检测

Sensitive Data Detection with High-Throughput Neural Network Models for Financial Institutions

论文作者

Truong, Anh, Walters, Austin, Goodsitt, Jeremy

论文摘要

命名实体识别已在许多领域进行了广泛的研究。但是,由于缺乏公开可用的数据集,尚未对金融机构中敏感实体检测应用于生产系统。在本文中,我们使用内部和合成数据集评估了在金融机构中常见的NPI(非公开个人身份)信息的各种方法,以非结构化和结构化数据格式。对两个预测任务进行了研究:(i)多种数据格式的实体检测,以及(ii)表格数据集中的实体检测。我们将这些模型与F1得分,精度,回忆和吞吐量相比,将这些模型与其他标准方法进行了比较。实际数据集包括带有手动标记标签的内部结构化数据和公共电子邮件数据。我们的实验结果表明,CNN模型在准确性和吞吐量方面很简单,但有效,因此是最合适的候选模型,该模型要在生产环境中部署。最后,我们为数据限制,数据标记和数据实体的内在重叠提供了几个经验教训。

Named Entity Recognition has been extensively investigated in many fields. However, the application of sensitive entity detection for production systems in financial institutions has not been well explored due to the lack of publicly available, labeled datasets. In this paper, we use internal and synthetic datasets to evaluate various methods of detecting NPI (Nonpublic Personally Identifiable) information commonly found within financial institutions, in both unstructured and structured data formats. Character-level neural network models including CNN, LSTM, BiLSTM-CRF, and CNN-CRF are investigated on two prediction tasks: (i) entity detection on multiple data formats, and (ii) column-wise entity prediction on tabular datasets. We compare these models with other standard approaches on both real and synthetic data, with respect to F1-score, precision, recall, and throughput. The real datasets include internal structured data and public email data with manually tagged labels. Our experimental results show that the CNN model is simple yet effective with respect to accuracy and throughput and thus, is the most suitable candidate model to be deployed in the production environment(s). Finally, we provide several lessons learned on data limitations, data labelling and the intrinsic overlap of data entities.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源