使用回归模型构建缺陷分类器的影响

论文标题

使用回归模型构建缺陷分类器的影响

The Impact of Using Regression Models to Build Defect Classifiers

论文作者

Rajbahadur, Gopi Krishnan, Wang, Shaowei, Kamei, Yasutaka, Hassan, Ahmed E.

论文摘要

通常将连续缺陷计数离散为有缺陷和非缺陷类别，并在构建缺陷分类器（离散分类器）时将其用作目标变量。但是，连续缺陷计数的这种离散化导致信息丢失，可能会影响缺陷分类器的性能和解释。构建缺陷分类器的另一种可能的方法是使用回归模型，然后将预测的缺陷计数离散为有缺陷和非缺陷类（基于回归的分类器）。在本文中，我们比较了使用这两种方法（即基于离散的分类器和基于回归的分类器）构建的缺陷分类器的性能和解释。我们发现：i）基于森林的随机分类器在两种分类器建筑方法上都优于其他分类器（最佳AUC）； ii）与共同实践相比，使用离散的缺陷计数（即离散分类器）构建缺陷分类器并不总是会带来更好的性能。因此，我们建议未来的缺陷分类研究应考虑建立基于回归的分类器（特别是当建模数据集的缺陷比率较低时）。此外，我们建议应探索两种用于构建缺陷分类器的方法，因此在确定最有影响力的功能时，可以使用表现最佳的分类器。

It is common practice to discretize continuous defect counts into defective and non-defective classes and use them as a target variable when building defect classifiers (discretized classifiers). However, this discretization of continuous defect counts leads to information loss that might affect the performance and interpretation of defect classifiers. Another possible approach to build defect classifiers is through the use of regression models then discretizing the predicted defect counts into defective and non-defective classes (regression-based classifiers). In this paper, we compare the performance and interpretation of defect classifiers that are built using both approaches (i.e., discretized classifiers and regression-based classifiers) across six commonly used machine learning classifiers (i.e., linear/logistic regression, random forest, KNN, SVM, CART, and neural networks) and 17 datasets. We find that: i) Random forest based classifiers outperform other classifiers (best AUC) for both classifier building approaches; ii) In contrast to common practice, building a defect classifier using discretized defect counts (i.e., discretized classifiers) does not always lead to better performance. Hence we suggest that future defect classification studies should consider building regression-based classifiers (in particular when the defective ratio of the modeled dataset is low). Moreover, we suggest that both approaches for building defect classifiers should be explored, so the best-performing classifier can be used when determining the most influential features.

下载PDF全文

下载文献需遵守相关版权规定

论文标题