论文标题
开源软件中开发人员专业知识的代表
Representation of Developer Expertise in Open Source Software
论文作者
论文摘要
背景:开发人员专业知识的准确表示一直是一个重要的研究问题。尽管许多研究提出了在单个项目中代表专业知识的新方法,但这些方法很难在生态系统层面上应用。但是,随着软件开发的重点从整体上转变为模块化,在整个OSS开发背景下代表开发人员专业知识的方法是必要的,例如,一个项目试图找到新的维护人员并寻找具有相关技能的开发人员。目的:我们旨在通过提出和构建每个API,开发人员和项目的代表并假定该空间的拓扑结构应如何反映开发人员所知道的内容(以及项目需要)来解决这个知识差距。方法:我们使用代码基础架构的世界来提取由开源开发人员更改的文件中的完整API集,并基于该数据,对API,开发人员和项目的向量表示doc2vec嵌入。然后,我们评估这些嵌入是否通过预测新的API/项目开发人员使用/加入以及是否接受其拉动请求来反映技能空间的假定拓扑。我们还检查了开发人员在技能领域中的表示如何与他们自我报告的API专业知识保持一致。结果:我们的结果表明,所提出的技能空间中提出的嵌入似乎满足了假定的拓扑结构,我们希望这样的表示可以帮助构建整个开源生态系统的信任(和效率)的信号,并可能有助于对与开发人员的能力和学习相关的其他现象进行调查。
Background: Accurate representation of developer expertise has always been an important research problem. While a number of studies proposed novel methods of representing expertise within individual projects, these methods are difficult to apply at an ecosystem level. However, with the focus of software development shifting from monolithic to modular, a method of representing developers' expertise in the context of the entire OSS development becomes necessary when, for example, a project tries to find new maintainers and look for developers with relevant skills. Aim: We aim to address this knowledge gap by proposing and constructing the Skill Space where each API, developer, and project is represented and postulate how the topology of this space should reflect what developers know (and projects need). Method: we use the World of Code infrastructure to extract the complete set of APIs in the files changed by open source developers and, based on that data, employ Doc2Vec embeddings for vector representations of APIs, developers, and projects. We then evaluate if these embeddings reflect the postulated topology of the Skill Space by predicting what new APIs/projects developers use/join, and whether or not their pull requests get accepted. We also check how the developers' representations in the Skill Space align with their self-reported API expertise. Result: Our results suggest that the proposed embeddings in the Skill Space appear to satisfy the postulated topology and we hope that such representations may aid in the construction of signals that increase trust (and efficiency) of open source ecosystems at large and may aid investigations of other phenomena related to developer proficiency and learning.