通过Minipatch学习来选择大量数据的功能

论文标题

通过Minipatch学习来选择大量数据的功能

Feature Selection for Huge Data via Minipatch Learning

论文作者

Yao, Tianyi, Allen, Genevera I.

论文摘要

特征选择通常会通过丢弃无关或冗余功能来提高模型可解释性，更快的计算以及改进的模型性能。尽管特征选择是许多广泛使用的技术的一个充分研究的问题，但通常有两个关键挑战：i）许多现有方法在具有数百万观察和特征的巨大数据中变得在计算上棘手； ii）所选特征在高噪声高度相关设置中降低的统计精度，从而阻碍了可靠的模型解释。我们通过提出稳定的微型捕获选择（邮票）和自适应邮票（Adastamps）来解决这些问题。这些是元算法的基础特征选择器的选择事件的集合，这些选项在许多小（自适应选择）的随机子集中训练了数据的观测和特征，我们称为Minipatches。我们的方法是一般的，可以采用各种现有的功能选择策略和机器学习技术。此外，我们还提供有关邮票的理论见解，并从经验上证明我们的方法，尤其是Adastamp，在功能选择准确性和计算时间方面主导了竞争方法。

Feature selection often leads to increased model interpretability, faster computation, and improved model performance by discarding irrelevant or redundant features. While feature selection is a well-studied problem with many widely-used techniques, there are typically two key challenges: i) many existing approaches become computationally intractable in huge-data settings with millions of observations and features; and ii) the statistical accuracy of selected features degrades in high-noise, high-correlation settings, thus hindering reliable model interpretation. We tackle these problems by proposing Stable Minipatch Selection (STAMPS) and Adaptive STAMPS (AdaSTAMPS). These are meta-algorithms that build ensembles of selection events of base feature selectors trained on many tiny, (adaptively-chosen) random subsets of both the observations and features of the data, which we call minipatches. Our approaches are general and can be employed with a variety of existing feature selection strategies and machine learning techniques. In addition, we provide theoretical insights on STAMPS and empirically demonstrate that our approaches, especially AdaSTAMPS, dominate competing methods in terms of feature selection accuracy and computational time.

下载PDF全文

下载文献需遵守相关版权规定

论文标题