分类指标的分析和比较

论文标题

分类指标的分析和比较

Analysis and Comparison of Classification Metrics

论文作者

Ferrer, Luciana

论文摘要

在机器学习文献中通常使用各种不同的性能指标来评估分类系统。衡量硬性决策质量的一些最常见的是标准和平衡的准确性，标准和平衡的错误率，F-beta评分以及Matthews相关系数（MCC）。在本文档中，我们回顾了这些指标和其他指标的定义，并将其与预期成本（EC）进行比较，这是在每个统计学习课程中介绍的指标，但在机器学习文献中很少使用。我们表明，标准错误率和平衡错误率都是EC的特殊情况。此外，我们表明了它与F-Beta分数和MCC的关系，并认为EC优于这些传统指标，因为它基于统计数据的第一原理，并且更一般，可解释和适应任何应用程序方案。上面提到的指标衡量了艰难决定的质量。但是，大多数现代分类系统输出我们可能想要直接评估的类的连续分数。测量系统分数质量的指标包括ROC曲线下的面积，相等的错误率，跨透明镜，Brier评分以及贝叶斯EC或贝叶斯风险等。最后三个指标是由适当评分规则（PSR）的预期价值给出的指标家族的特殊情况。我们回顾了这些指标背后的理论，表明它们是衡量系统产生的后验质量质量的原则方法。最后，我们展示了如何使用这些指标来计算系统的校准损失，并将该指标与广泛使用的预期校准误差（ECE）进行比较，认为基于PSRS的校准损失优于ECE，对于更具解释性，更一般，更一般性，并且直接适用于多杆案例，以及其他原因。

A variety of different performance metrics are commonly used in the machine learning literature for the evaluation of classification systems. Some of the most common ones for measuring quality of hard decisions are standard and balanced accuracy, standard and balanced error rate, F-beta score, and Matthews correlation coefficient (MCC). In this document, we review the definition of these and other metrics and compare them with the expected cost (EC), a metric introduced in every statistical learning course but rarely used in the machine learning literature. We show that both the standard and balanced error rates are special cases of the EC. Further, we show its relation with F-beta score and MCC and argue that EC is superior to these traditional metrics for being based on first principles from statistics, and for being more general, interpretable, and adaptable to any application scenario. The metrics mentioned above measure the quality of hard decisions. Yet, most modern classification systems output continuous scores for the classes which we may want to evaluate directly. Metrics for measuring the quality of system scores include the area under the ROC curve, equal error rate, cross-entropy, Brier score, and Bayes EC or Bayes risk, among others. The last three metrics are special cases of a family of metrics given by the expected value of proper scoring rules (PSRs). We review the theory behind these metrics, showing that they are a principled way to measure the quality of the posterior probabilities produced by a system. Finally, we show how to use these metrics to compute a system's calibration loss and compare this metric with the widely-used expected calibration error (ECE), arguing that calibration loss based on PSRs is superior to the ECE for being more interpretable, more general, and directly applicable to the multi-class case, among other reasons.

下载PDF全文

下载文献需遵守相关版权规定

论文标题