论文标题

大规模化假名:OLCF的峰会使用数据案例研究

Pseudonymization at Scale: OLCF's Summit Usage Data Case Study

论文作者

Maheshwari, Ketan, Wilkinson, Sean R., May, Alex, Skluzacek, Tyler, Kuchar, Olga A., da Silva, Rafael Ferreira

论文摘要

传统上,对大量数据和复杂计算作业的处理分析依赖于高性能计算(HPC)系统。了解这些分析的需求对于设计可以带来更好科学的解决方案至关重要,同样,了解这些系统上用户行为的特征对于改善HPC系统上的用户体验很重要。收集有关用户行为数据的常见方法是分析仅适用于系统管理员的系统日志数据。但是,最近在Oak Ridge领导力计算设施(OLCF),我们通过使用普通的UNIX命令从用户的角度收集数据来揭示有关Summit SuperCupture的用户行为。 在这里,我们在准备此数据集的同时讨论了该过程,挑战和经验教训,以进行出版和提交开放数据挑战。原始数据集包含有关OLCF用户的个人可识别信息(PII),这些信息需要在发布前被掩盖,并且我们确定匿名化完全擦洗了PII,它摧毁了数据的太多结构,这对于数据挑战而言是有趣的。相反,我们选择将数据集化为假名,以降低其与用户身份的连接性。假名在计算上比匿名化要贵得多,并且我们的数据集的大小(约1.75亿行的原始文本)需要开发并行的工作流,该工作流可以在不同的HPC机器上重复使用。我们在OLCF的两个领导力类HPC系统上展示了工作流程的扩展行为,并且我们表明我们能够将整个Makepan的时间从一个不切实际的20小时以上的时间带到一个节点下降到大约2个小时。由于这项工作,我们发布了整个假名数据集,并公开提供工作流程和源代码。

The analysis of vast amounts of data and the processing of complex computational jobs have traditionally relied upon high performance computing (HPC) systems. Understanding these analyses' needs is paramount for designing solutions that can lead to better science, and similarly, understanding the characteristics of the user behavior on those systems is important for improving user experiences on HPC systems. A common approach to gathering data about user behavior is to analyze system log data available only to system administrators. Recently at Oak Ridge Leadership Computing Facility (OLCF), however, we unveiled user behavior about the Summit supercomputer by collecting data from a user's point of view with ordinary Unix commands. Here, we discuss the process, challenges, and lessons learned while preparing this dataset for publication and submission to an open data challenge. The original dataset contains personal identifiable information (PII) about OLCF users which needed be masked prior to publication, and we determined that anonymization, which scrubs PII completely, destroyed too much of the structure of the data to be interesting for the data challenge. We instead chose to pseudonymize the dataset to reduce its linkability to users' identities. Pseudonymization is significantly more computationally expensive than anonymization, and the size of our dataset, approximately 175 million lines of raw text, necessitated the development of a parallelized workflow that could be reused on different HPC machines. We demonstrate the scaling behavior of the workflow on two leadership class HPC systems at OLCF, and we show that we were able to bring the overall makespan time from an impractical 20+ hours on a single node down to around 2 hours. As a result of this work, we release the entire pseudonymized dataset and make the workflows and source code publicly available.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源