论文标题
解码机学习基准
Decoding machine learning benchmarks
论文作者
论文摘要
尽管有基准机器学习(ML)存储库(例如UCI,OpenML)的可用性,但尚无标准评估策略,但仍能指出哪种数据集是测试不同ML算法的黄金标准集。在最近的研究中,项目响应理论(IRT)已成为一种新方法,以阐明应该是良好的ML基准。这项工作应用了IRT来探索众所周知的OpenML-CC18基准,以确定其在分类器评估中的合适性。使用IRT模型评估了从经典到集合的几个分类器,它们可以同时估计数据集难度和分类器的能力。 Glicko-2等级系统应用于IRT的顶部,以总结与众不同的分类器的能力和能力。据观察,并非所有来自OpenML-CC18的数据集对于评估分类器都非常有用。在这项工作中评估的大多数数据集(84%)通常包含简单的实例(例如,仅占困难实例的10%)。同样,该基准的一半实例中有80%是非常歧视的实例,这对于成对算法比较可能非常有用,但对推动分类器的能力没有用。本文介绍了基于IRT的新评估方法以及工具解码器,该方法是为指导ML基准测试的IRT估计而开发的。
Despite the availability of benchmark machine learning (ML) repositories (e.g., UCI, OpenML), there is no standard evaluation strategy yet capable of pointing out which is the best set of datasets to serve as gold standard to test different ML algorithms. In recent studies, Item Response Theory (IRT) has emerged as a new approach to elucidate what should be a good ML benchmark. This work applied IRT to explore the well-known OpenML-CC18 benchmark to identify how suitable it is on the evaluation of classifiers. Several classifiers ranging from classical to ensembles ones were evaluated using IRT models, which could simultaneously estimate dataset difficulty and classifiers' ability. The Glicko-2 rating system was applied on the top of IRT to summarize the innate ability and aptitude of classifiers. It was observed that not all datasets from OpenML-CC18 are really useful to evaluate classifiers. Most datasets evaluated in this work (84%) contain easy instances in general (e.g., around 10% of difficult instances only). Also, 80% of the instances in half of this benchmark are very discriminating ones, which can be of great use for pairwise algorithm comparison, but not useful to push classifiers abilities. This paper presents this new evaluation methodology based on IRT as well as the tool decodIRT, developed to guide IRT estimation over ML benchmarks.