测试套件有效性指标评估：我们知道什么，应该做什么？

论文标题

测试套件有效性指标评估：我们知道什么，应该做什么？

Test suite effectiveness metric evaluation: what do we know and what should we do?

论文作者

Zhang, Peng, Wang, Yang, Liu, Xutong, Yang, Yibiao, Li, Yanhui, Chen, Lin, Wang, Ziyuan, Sun, Chang-ai, Zhou, Yuming

论文摘要

比较测试套件有效性指标一直是研究热点。但是，先前的研究有不同的结论，甚至相互矛盾，以比较不同的测试套件有效性指标。我们发现对社区最困扰的问题是，研究人员倾向于过分简化对他们使用的地面真理的描述。例如，一个共同的表达是“我们研究了实际断层与评估度量的度量（MTE）之间的相关性”。但是，“真正的故障”的含义并不明确。结果，有必要仔细检查“真实故障”的含义。没有这个，结论将是一半的知识。为了应对这一挑战，我们提出了一个框架同意（评估测试套件有效性指标），以指导后续研究。在本质上，同意由三个基本组成部分组成：地面真相，基准测试套件和协议指标。首先，实现了确定测试套件中有效性的真实顺序的地面真相。其次，生成一组基准测试套件，并在有效性上得出了地面真相。第三，对于基准测试套件，可以通过度量标准生成MTE订单（MTE）。最后，计算两个订单之间的协议指标。在同意下，我们能够比较不同测试套件有效性指标的准确性。我们同意评估代表性测试套件有效性指标，包括突变评分指标和代码覆盖率指标。我们的结果表明，基于实际断层，突变评分和集合突变评分是量化测试套件有效性的最佳指标。同时，通过使用突变体代替实际断层，MTE的值将被高估超过20％。

Comparing test suite effectiveness metrics has always been a research hotspot. However, prior studies have different conclusions or even contradict each other for comparing different test suite effectiveness metrics. The problem we found most troubling to our community is that researchers tend to oversimplify the description of the ground truth they use. For example, a common expression is that "we studied the correlation between real faults and the metric to evaluate (MTE)". However, the meaning of "real faults" is not clear-cut. As a result, there is a need to scrutinize the meaning of "real faults". Without this, it will be half-knowledgeable with the conclusions. To tackle this challenge, we propose a framework ASSENT (evAluating teSt Suite EffectiveNess meTrics) to guide the follow-up research. In nature, ASSENT consists of three fundamental components: ground truth, benchmark test suites, and agreement indicator. First, materialize the ground truth for determining the real order in effectiveness among test suites. Second, generate a set of benchmark test suites and derive their ground truth order in effectiveness. Third, for the benchmark test suites, generate the MTE order in effectiveness by the metric to evaluate (MTE). Finally, calculate the agreement indicator between the two orders. Under ASSENT, we are able to compare the accuracy of different test suite effectiveness metrics. We apply ASSENT to evaluate representative test suite effectiveness metrics, including mutation score metrics and code coverage metrics. Our results show that, based on the real faults, mutation score and subsuming mutation score are the best metrics to quantify test suite effectiveness. Meanwhile, by using mutants instead of real faults, MTEs will be overestimated by more than 20% in values.

下载PDF全文

下载文献需遵守相关版权规定

论文标题