论文标题

早期生命周期软件缺陷预测。为什么?如何?

Early Life Cycle Software Defect Prediction. Why? How?

论文作者

Shrikanth, N. C., Majumder, Suvodeep, Menzies, Tim

论文摘要

许多研究人员认为,对于软件分析,“更多数据更好”。我们写的是要表明,至少对于学习缺陷预测指标,这可能不是正确的。为了证明这一点,我们分析了数百个流行的GitHub项目。这些项目运行了84个月,并包含3,728个提交(中值值)。在这些项目中,大多数缺陷都发生在其生命周期的早期。因此,从前150个提交中学到的缺陷预测因素和四个月的表现和其他任何事情。这意味着,至少对于这里研究的项目,在最初的几个月之后,我们无需不断更新缺陷预测模型。我们希望这些结果激发其他研究人员对他们的工作采取“简单性优先”的方法。某些领域需要进行复杂且渴望数据的分析。但是,在假设复杂性之前,谨慎检查可以简化分析的“捷径”的原始数据。

Many researchers assume that, for software analytics, "more data is better." We write to show that, at least for learning defect predictors, this may not be true. To demonstrate this, we analyzed hundreds of popular GitHub projects. These projects ran for 84 months and contained 3,728 commits (median values). Across these projects, most of the defects occur very early in their life cycle. Hence, defect predictors learned from the first 150 commits and four months perform just as well as anything else. This means that, at least for the projects studied here, after the first few months, we need not continually update our defect prediction models. We hope these results inspire other researchers to adopt a "simplicity-first" approach to their work. Some domains require a complex and data-hungry analysis. But before assuming complexity, it is prudent to check the raw data looking for "short cuts" that can simplify the analysis.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源