基质：一种基于子集的策略，用于更快的汽车

论文标题

基质：一种基于子集的策略，用于更快的汽车

SubStrat: A Subset-Based Strategy for Faster AutoML

论文作者

Lazebnik, Teddy, Somech, Amit, Weinberg, Abraham Itzhak

论文摘要

自动化机器学习（AUTOML）框架已成为数据科学家武器库中的重要工具，因为它们大大减少了专门用于ML管道构建的手动工作。此类框架在数百万个可能的ML管道中智能搜索 - 通常包含功能工程，模型选择和超级参数调整步骤 - 并最终以预测精度输出最佳管道。但是，当数据集很大时，每个单独的配置都需要更长的时间才能执行，因此总体自动运行时间越来越高。为此，我们提出基质，这是一种可以解决数据大小而不是配置空间的汽车优化策略。它包装了现有的自动工具，而不是直接在整个数据集上执行它们，而是使用基于遗传的算法来找到一个小而代表性的数据子集，该算法保留了完整数据的特定特征。然后，它在小子集上使用了Automl工具，最后，它通过在大型数据集中执行限制的，较短的自动进程来完善所得管道。我们的实验结果在两个流行的自动框架上进行的自动式框架和TPOT进行了表明，基质将其运行时间降低了79％（平均），而所得ML管道的准确性平均损失少于2％。

Automated machine learning (AutoML) frameworks have become important tools in the data scientists' arsenal, as they dramatically reduce the manual work devoted to the construction of ML pipelines. Such frameworks intelligently search among millions of possible ML pipelines - typically containing feature engineering, model selection and hyper parameters tuning steps - and finally output an optimal pipeline in terms of predictive accuracy. However, when the dataset is large, each individual configuration takes longer to execute, therefore the overall AutoML running times become increasingly high. To this end, we present SubStrat, an AutoML optimization strategy that tackles the data size, rather than configuration space. It wraps existing AutoML tools, and instead of executing them directly on the entire dataset, SubStrat uses a genetic-based algorithm to find a small yet representative data subset which preserves a particular characteristic of the full data. It then employs the AutoML tool on the small subset, and finally, it refines the resulted pipeline by executing a restricted, much shorter, AutoML process on the large dataset. Our experimental results, performed on two popular AutoML frameworks, Auto-Sklearn and TPOT, show that SubStrat reduces their running times by 79% (on average), with less than 2% average loss in the accuracy of the resulted ML pipeline.

下载PDF全文

下载文献需遵守相关版权规定

论文标题