论文标题

确定公共软件开发历史的内在结构

Determining the Intrinsic Structure of Public Software Development History

论文作者

Pietri, Antoine, Rousseau, Guillaume, Zacchiroli, Stefano

论文摘要

背景。协作软件开发已经生成了大量版本控制系统(VCS)数据,现在可以全面分析。关于作为互连图的整个公共可用VC的整个语料库的内在结构知之甚少。需要了解其结构,以确定最佳的方法来完全分析它,并在这样做时避免方法论陷阱。客观的。我们打算确定VC捕获的公共软件开发历史记录的最显着网络Topol-ogy属性。我们将探索:学位分布,确定它们是否没有规模;连接组件大小的分布;最短路径长度的分布。我们将使用软件遗产 - 这是使用WebGraph压缩技术最大的公共VCS数据压缩语料库,并使用经典图形算法中内存分析。分析将在完整图和相关子图上进行。限制。这项研究本质上是探索性的。因此,目前尚未陈述发现的假设。所选的图算法有望扩展到语料库的大小,但需要通过实验确认。外部有效性将取决于软件共享的代表性软件遗产。

Background. Collaborative software development has produced a wealth of version control system (VCS) data that can now be analyzed in full. Little is known about the intrinsic structure of the entire corpus of publicly available VCS as an interconnected graph. Understanding its structure is needed to determine the best approach to analyze it in full and to avoid methodological pitfalls when doing so. Objective. We intend to determine the most salient network topol-ogy properties of public software development history as captured by VCS. We will explore: degree distributions, determining whether they are scale-free or not; distribution of connect component sizes; distribution of shortest path lengths.Method. We will use Software Heritage-which is the largest corpus of public VCS data-compress it using webgraph compression techniques, and analyze it in-memory using classic graph algorithms. Analyses will be performed both on the full graph and on relevant subgraphs. Limitations. The study is exploratory in nature; as such no hypotheses on the findings is stated at this time. Chosen graph algorithms are expected to scale to the corpus size, but it will need to be confirmed experimentally. External validity will depend on how representative Software Heritage is of the software commons.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源