论文标题

20-MAD- Mozilla和Apache Development的20年问题和提交

20-MAD -- 20 Years of Issues and Commits of Mozilla and Apache Development

论文作者

Claes, Maëlick, Mäntylä, Mika

论文摘要

长期和知名项目的数据对于野外成功的软件工程研究很有价值。拥有具有此类项目的不同链接软件存储库的数据集,可以进行更深入的潜水调查。本文介绍了20-MAD,这是一个链接Mozilla和Apache项目的提交和数据数据的数据集。它包括有关765个项目,340万个提交,230万发行和1730万发行评论的20多年信息,其压缩大小超过6 GB。数据包含有关源代码提交的所有典型信息(例如,添加和删除的行,消息和提交时间)和问题(状态,严重性,投票和摘要)。该问题评论已被预处理用于自然语言处理和情感分析。这包括表情符号,价和唤醒分数。链接代码存储库并发出跟踪器信息,允许研究两种类型的存储库中的个人,并为问题跟踪器提供更准确的时区信息。据我们所知,这是不基于GitHub的最大链接数据集和项目寿命。

Data of long-lived and high profile projects is valuable for research on successful software engineering in the wild. Having a dataset with different linked software repositories of such projects, enables deeper diving investigations. This paper presents 20-MAD, a dataset linking the commit and issue data of Mozilla and Apache projects. It includes over 20 years of information about 765 projects, 3.4M commits, 2.3M issues, and 17.3M issue comments, and its compressed size is over 6 GB. The data contains all the typical information about source code commits (e.g., lines added and removed, message and commit time) and issues (status, severity, votes, and summary). The issue comments have been pre-processed for natural language processing and sentiment analysis. This includes emoticons and valence and arousal scores. Linking code repository and issue tracker information, allows studying individuals in two types of repositories and provide more accurate time zone information for issue trackers as well. To our knowledge, this the largest linked dataset in size and in project lifetime that is not based on GitHub.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源