论文标题
使用包装的后者选择可再现的模型
Reproducible Model Selection Using Bagged Posteriors
论文作者
论文摘要
贝叶斯模型选择的前提是假设数据是从一个假定模型中生成的。但是,在许多应用程序中,所有这些模型都是不正确的(也就是说,有指定性的)。当模型被弄清楚时,两个或多个模型可以为数据提供几乎同样良好的拟合,在这种情况下,贝叶斯模型选择可能是高度不稳定的,可能导致自相矛盾的发现。为了解决这种不稳定性,我们建议在后部分布(“贝叶斯袋”)上使用包装,也就是说,对于许多自举数据集的平均后验模型概率。我们提供了理论上的结果,表征了(未指定的)模型选择设置中后部和袋式后部的渐近行为。我们在(i)线性回归和(ii)系统发育树重建中的特征选择中的合成和现实数据中的贝叶斯袋方法从经验上评估了贝叶斯袋方法。我们的理论和实验表明,与通常的贝叶斯后部相比,当所有模型被弄清楚时,贝叶斯袋(a)提供了更大的可重复性,并且(b)将后质量放在最佳模型上。另一方面,在正确的规格下,贝叶斯袋比通常的后部更为保守,从某种意义上说,贝叶斯袋后验概率往往与零极端和一个相距较远。总体而言,我们的结果表明,贝叶斯袋提供了一种易于使用且广泛适用的方法,可以通过使其更稳定和可重现来改善贝叶斯模型选择。
Bayesian model selection is premised on the assumption that the data are generated from one of the postulated models. However, in many applications, all of these models are incorrect (that is, there is misspecification). When the models are misspecified, two or more models can provide a nearly equally good fit to the data, in which case Bayesian model selection can be highly unstable, potentially leading to self-contradictory findings. To remedy this instability, we propose to use bagging on the posterior distribution ("BayesBag") -- that is, to average the posterior model probabilities over many bootstrapped datasets. We provide theoretical results characterizing the asymptotic behavior of the posterior and the bagged posterior in the (misspecified) model selection setting. We empirically assess the BayesBag approach on synthetic and real-world data in (i) feature selection for linear regression and (ii) phylogenetic tree reconstruction. Our theory and experiments show that, when all models are misspecified, BayesBag (a) provides greater reproducibility and (b) places posterior mass on optimal models more reliably, compared to the usual Bayesian posterior; on the other hand, under correct specification, BayesBag is slightly more conservative than the usual posterior, in the sense that BayesBag posterior probabilities tend to be slightly farther from the extremes of zero and one. Overall, our results demonstrate that BayesBag provides an easy-to-use and widely applicable approach that improves upon Bayesian model selection by making it more stable and reproducible.