论文标题
多模型子集选择
Multi-Model Subset Selection
论文作者
论文摘要
高维回归问题的两种主要方法是稀疏的方法(例如,最佳子集选择,使用罚款中的L0-norm)和集合方法(例如随机森林)。尽管稀疏方法通常会产生可解释的模型,但就预测准确性而言,它们通常比“黑框”多模型集合方法表现出色。引入了回归合奏,该集合结合了稀疏方法的解释性与集合方法的高预测准确性。提出了一种算法,以通过扩展稀疏方法的L0优化方法扩展到多模型回归集合的L0优化方法来解决相应L0-拟合回归模型的关节优化。合奏中稀疏而多样的模型是从数据同时学习的。这些模型中的每一个都为预测变量和响应变量的子集之间的关系提供了解释。有关集合的经验研究和理论知识用于洞悉整体方法的表现,重点是偏见,方差,协方差和可变选择之间的相互作用。在预测任务中,合奏可以在模拟和真实数据上胜过最先进的竞争对手。向前逐步回归还被推广到多模型回归集合,并用于获得该算法的初始解决方案。优化算法是在公开可用的软件包中实现的。
The two primary approaches for high-dimensional regression problems are sparse methods (e.g., best subset selection, which uses the L0-norm in the penalty) and ensemble methods (e.g., random forests). Although sparse methods typically yield interpretable models, in terms of prediction accuracy they are often outperformed by "blackbox" multi-model ensemble methods. A regression ensemble is introduced which combines the interpretability of sparse methods with the high prediction accuracy of ensemble methods. An algorithm is proposed to solve the joint optimization of the corresponding L0-penalized regression models by extending recent developments in L0-optimization for sparse methods to multi-model regression ensembles. The sparse and diverse models in the ensemble are learned simultaneously from the data. Each of these models provides an explanation for the relationship between a subset of predictors and the response variable. Empirical studies and theoretical knowledge about ensembles are used to gain insight into the ensemble method's performance, focusing on the interplay between bias, variance, covariance, and variable selection. In prediction tasks, the ensembles can outperform state-of-the-art competitors on both simulated and real data. Forward stepwise regression is also generalized to multi-model regression ensembles and used to obtain an initial solution for the algorithm. The optimization algorithms are implemented in publicly available software packages.