特征重要性的模型 - 不合时宜的置信区间：使用Minipatch合奏的快速而强大的方法

论文标题

特征重要性的模型 - 不合时宜的置信区间：使用Minipatch合奏的快速而强大的方法

Model-Agnostic Confidence Intervals for Feature Importance: A Fast and Powerful Approach Using Minipatch Ensembles

论文作者

Gan, Luqin, Zheng, Lili, Allen, Genevera I.

论文摘要

为了从复杂的数据集中促进新的科学发现，特征重要的推论一直是一个长期存在的统计问题。与其测试仅针对特定模型可以解释的参数，不如以特征遮挡或保留的旋转（loco）推理的形式越来越兴趣。现有的方法通常会做出分配假设，这在实践中可能很难验证，或者需要模型改装和数据拆分，这些模型在计算上是密集的，并导致权力损失。在这项工作中，我们开发了一种新颖的，大多是模型不合时宜的和无分布的推理框架，以具有计算上有效且统计上强大的功能重要性。我们的方法很快，因为我们通过利用一种随机观察形式和特征子采样称为Minipatch Gemembles来避免模型进行改装；这种方法还通过避免数据拆分来提高统计能力。我们的框架可以应用于表格数据以及任何机器学习算法以及Minipatch合奏，以进行回归和分类任务。尽管使用Minipatch合奏引起的依赖关系，但我们表明我们的方法为在轻度假设下的任何模型的特征重要性得分提供了渐近覆盖范围。最后，我们的同一过程也可以利用以提供预测的有效置信区间，因此可以同时量化这两个预测和特征重要性的不确定性。我们验证了一系列合成和真实数据示例（包括非线性设置）的间隔，表明我们的方法检测到正确的重要特征，并在现有方法上表现出许多计算和统计优势。

To promote new scientific discoveries from complex data sets, feature importance inference has been a long-standing statistical problem. Instead of testing for parameters that are only interpretable for specific models, there has been increasing interest in model-agnostic methods, often in the form of feature occlusion or leave-one-covariate-out (LOCO) inference. Existing approaches often make distributional assumptions, which can be difficult to verify in practice, or require model refitting and data splitting, which are computationally intensive and lead to losses in power. In this work, we develop a novel, mostly model-agnostic and distribution-free inference framework for feature importance that is computationally efficient and statistically powerful. Our approach is fast as we avoid model refitting by leveraging a form of random observation and feature subsampling called minipatch ensembles; this approach also improves statistical power by avoiding data splitting. Our framework can be applied on tabular data and with any machine learning algorithm, together with minipatch ensembles, for regression and classification tasks. Despite the dependencies induced by using minipatch ensembles, we show that our approach provides asymptotic coverage for the feature importance score of any model under mild assumptions. Finally, our same procedure can also be leveraged to provide valid confidence intervals for predictions, hence providing fast, simultaneous quantification of the uncertainty of both predictions and feature importance. We validate our intervals on a series of synthetic and real data examples, including non-linear settings, showing that our approach detects the correct important features and exhibits many computational and statistical advantages over existing methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题