论文标题
在机器学习中建模概括:一种方法和计算研究
Modeling Generalization in Machine Learning: A Methodological and Computational Study
论文作者
论文摘要
随着机器学习变得越来越多地为公众使用,理论问题正在变成紧迫的实际问题。可能,最相关的问题之一是评估我们对信任机器学习预测的信心。在许多现实世界中,估计机器学习算法的能力概括,即根据目标问题的特征,可以对未见数据进行准确的预测,至关重要。在这项工作中,我们对109个公共可用分类数据集进行了荟萃分析,将机器学习概括建模是各种数据集特征的函数,从样本数到内在维度范围,从类别的特征偏压度到$ f1 $ f1 $ f1 $ f1 $在测试样本中评估了训练集的Convex Hull hull hull hull hull hull of Training hull of the训练集。实验结果表明,通过强调插值和推算预测之间的差异,使用训练数据凸的概念在评估机器学习概括中的相关性。除了几种可预测的相关性外,我们还观察到机器学习模型的概括能力与所有与维度性相关的指标之间出乎意料的弱关联,从而挑战了一个共同的假设,即\ textit {维数的诅咒}可能会损害机器学习中的概括。
As machine learning becomes more and more available to the general public, theoretical questions are turning into pressing practical issues. Possibly, one of the most relevant concerns is the assessment of our confidence in trusting machine learning predictions. In many real-world cases, it is of utmost importance to estimate the capabilities of a machine learning algorithm to generalize, i.e., to provide accurate predictions on unseen data, depending on the characteristics of the target problem. In this work, we perform a meta-analysis of 109 publicly-available classification data sets, modeling machine learning generalization as a function of a variety of data set characteristics, ranging from number of samples to intrinsic dimensionality, from class-wise feature skewness to $F1$ evaluated on test samples falling outside the convex hull of the training set. Experimental results demonstrate the relevance of using the concept of the convex hull of the training data in assessing machine learning generalization, by emphasizing the difference between interpolated and extrapolated predictions. Besides several predictable correlations, we observe unexpectedly weak associations between the generalization ability of machine learning models and all metrics related to dimensionality, thus challenging the common assumption that the \textit{curse of dimensionality} might impair generalization in machine learning.