论文标题
海洋邮件列表数据集:网络分析涵盖邮件列表和代码存储库
The OCEAN mailing list data set: Network analysis spanning mailing lists and code repositories
论文作者
论文摘要
围绕开源项目开发的沟通主要发生在软件存储库本身之外。从历史上看,大型社区经常使用邮件列表的集合来讨论其项目的不同方面。多模式工具的使用,随着软件开发和在不同渠道上发生的通信的使用,使对开源项目作为社会技术系统的研究变得复杂。在这里,我们结合了Python社区的邮件列表并标准化,从1995年到现在产生了954,287条消息。我们共享所有刮擦和清洁代码,以促进这项工作的复制,以及Golang(122,721条消息),Angular(20,041条消息)和Node.js(12,514条消息)社区的较小数据集。为了展示这些数据的有用性,我们专注于CPYTHON存储库,并通过识别邮件列表数据中的33%的GitHub贡献者来合并技术层(GitHub帐户在哪个文件和与谁)上使用的社交层(来自唯一电子邮件地址的消息)。然后,我们探索社会消息的价值与协作网络的结构之间的相关性。我们讨论了这些数据如何为大型开源项目中标准组织科学测试理论的实验室。
Communication surrounding the development of an open source project largely occurs outside the software repository itself. Historically, large communities often used a collection of mailing lists to discuss the different aspects of their projects. Multimodal tool use, with software development and communication happening on different channels, complicates the study of open source projects as a sociotechnical system. Here, we combine and standardize mailing lists of the Python community, resulting in 954,287 messages from 1995 to the present. We share all scraping and cleaning code to facilitate reproduction of this work, as well as smaller datasets for the Golang (122,721 messages), Angular (20,041 messages) and Node.js (12,514 messages) communities. To showcase the usefulness of these data, we focus on the CPython repository and merge the technical layer (which GitHub account works on what file and with whom) with the social layer (messages from unique email addresses) by identifying 33% of GitHub contributors in the mailing list data. We then explore correlations between the valence of social messaging and the structure of the collaboration network. We discuss how these data provide a laboratory to test theories from standard organizational science in large open source projects.