从2B GIT提交的3800万作者ID的数据集和标识解决方法的方法

论文标题

从2B GIT提交的3800万作者ID的数据集和标识解决方法的方法

A Dataset and an Approach for Identity Resolution of 38 Million Author IDs extracted from 2B Git Commits

论文作者

Fry, Tanner, Dey, Tapajit, Karnauch, Andrey, Mockus, Audris

论文摘要

从开源项目中收集的数据提供了对大型软件生态系统建模的方法，但通常会遇到数据质量问题，特别是，代码提交中的多个作者身份证字符串实际上可能与一个开发人员相关联。尽管已经提出了许多解决此问题的方法，但它们要么是需要手动调整的启发式方法，要么需要太多计算时间来进行3800万作者ID的成对比较，例如，代码收集世界。在本文中，我们提出了一种在整个数据集中找到属于单个开发人员的所有作者ID的方法，并共享所有发现具有别名的作者ID的列表。为此，我们首先创建了潜在连接的作者ID的块，然后使用机器学习模型来预测这些潜在相关的ID中的哪一个属于同一开发人员。我们处理了大约3800万个作者ID，发现大约1480万ID的别名属于540万不同的开发人员，中位别名的中位数为每个开发人员2。该数据集可用于在整个OSS生态系统级别上创建更准确的开发人员行为模型，并可用于提供快速解决新作者ID的服务。

The data collected from open source projects provide means to model large software ecosystems, but often suffer from data quality issues, specifically, multiple author identification strings in code commits might actually be associated with one developer. While many methods have been proposed for addressing this problem, they are either heuristics requiring manual tweaking, or require too much calculation time to do pairwise comparisons for 38M author IDs in, for example, the World of Code collection. In this paper, we propose a method that finds all author IDs belonging to a single developer in this entire dataset, and share the list of all author IDs that were found to have aliases. To do this, we first create blocks of potentially connected author IDs and then use a machine learning model to predict which of these potentially related IDs belong to the same developer. We processed around 38 million author IDs and found around 14.8 million IDs to have an alias, which belong to 5.4 million different developers, with the median number of aliases being 2 per developer. This dataset can be used to create more accurate models of developer behaviour at the entire OSS ecosystem level and can be used to provide a service to rapidly resolve new author IDs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题