论文标题
非线性指标的最佳决策树
Optimal Decision Trees for Nonlinear Metrics
论文作者
论文摘要
非线性指标,例如F1得分,Matthews相关系数和Fowlkes-Mallows索引,通常用于评估机器学习模型的性能,尤其是面对不平衡数据集,该数据集包含比另一个类别更多的类型的样本。最近的最佳决策树算法在生产有关线性标准(例如准确性)最佳的树中显示出了显着的进展,但不幸的是,非线性指标仍然是一个挑战。为了解决这一差距,我们提出了一种基于双目标优化的新算法,该算法将每个二进制类别的错误分类视为一个单独的目标。我们表明,对于大量的指标,最佳树位于帕累托前沿。因此,我们通过使用我们的方法生成所有非主导树的集合来获得最佳树。据我们所知,这是第一种计算非线性指标最佳决策树的方法。与优化线性指标相比,我们的方法导致了权衡:根据给定的非线性指标,所产生的树木可能更可取,而牺牲了更高的运行时间。然而,实验表明,对于大多数测试的数据集来说,运行时间是合理的。
Nonlinear metrics, such as the F1-score, Matthews correlation coefficient, and Fowlkes-Mallows index, are often used to evaluate the performance of machine learning models, in particular, when facing imbalanced datasets that contain more samples of one class than the other. Recent optimal decision tree algorithms have shown remarkable progress in producing trees that are optimal with respect to linear criteria, such as accuracy, but unfortunately nonlinear metrics remain a challenge. To address this gap, we propose a novel algorithm based on bi-objective optimisation, which treats misclassifications of each binary class as a separate objective. We show that, for a large class of metrics, the optimal tree lies on the Pareto frontier. Consequently, we obtain the optimal tree by using our method to generate the set of all nondominated trees. To the best of our knowledge, this is the first method to compute provably optimal decision trees for nonlinear metrics. Our approach leads to a trade-off when compared to optimising linear metrics: the resulting trees may be more desirable according to the given nonlinear metric at the expense of higher runtimes. Nevertheless, the experiments illustrate that runtimes are reasonable for majority of the tested datasets.