一种基于梯度的双层优化方法，用于调整机器学习中的超参数

论文标题

一种基于梯度的双层优化方法，用于调整机器学习中的超参数

A Gradient-based Bilevel Optimization Approach for Tuning Hyperparameters in Machine Learning

论文作者

Sinha, Ankur, Khandait, Tanmay, Mohanty, Raja

论文摘要

超参数调整是机器学习研究的活跃领域，其目的是确定在验证集中提供最佳性能的最佳超参数。通常使用幼稚的技术（例如随机搜索和网格搜索）来实现高参数调整。但是，这些方法中的大多数很少会导致一组最佳的超参数，并且通常会变得非常昂贵。在本文中，我们提出了一种双重解决方案方法，用于解决不受早期研究缺点的超参数优化问题。所提出的方法是一般的，可以轻松地应用于任何类别的机器学习算法。该想法基于较低级别最佳值函数映射的近似值，这是双重优化中的重要映射，有助于将二聚体问题减少到单个级别的约束优化任务。使用增强的拉格朗日方法解决了单层约束优化问题。我们讨论了提出的算法背后的理论，并对两个数据集进行了广泛的计算研究，以证实该方法的效率。我们对网格搜索，随机搜索和贝叶斯优化技术进行了比较研究，该技术表明，在一个或两个超参数的问题上，所提出的算法多次多次。随着超参数数量的增加，计算增益预计将显着更高。对应于给定的超参数文献中的大多数技术通常会假设一个独特的最佳参数集，从而最大程度地减少了训练集的损失。这种假设通常受到深度学习架构的侵犯，而所提出的方法不需要任何这样的假设。

Hyperparameter tuning is an active area of research in machine learning, where the aim is to identify the optimal hyperparameters that provide the best performance on the validation set. Hyperparameter tuning is often achieved using naive techniques, such as random search and grid search. However, most of these methods seldom lead to an optimal set of hyperparameters and often get very expensive. In this paper, we propose a bilevel solution method for solving the hyperparameter optimization problem that does not suffer from the drawbacks of the earlier studies. The proposed method is general and can be easily applied to any class of machine learning algorithms. The idea is based on the approximation of the lower level optimal value function mapping, which is an important mapping in bilevel optimization and helps in reducing the bilevel problem to a single level constrained optimization task. The single-level constrained optimization problem is solved using the augmented Lagrangian method. We discuss the theory behind the proposed algorithm and perform extensive computational study on two datasets that confirm the efficiency of the proposed method. We perform a comparative study against grid search, random search and Bayesian optimization techniques that shows that the proposed algorithm is multiple times faster on problems with one or two hyperparameters. The computational gain is expected to be significantly higher as the number of hyperparameters increase. Corresponding to a given hyperparameter most of the techniques in the literature often assume a unique optimal parameter set that minimizes loss on the training set. Such an assumption is often violated by deep learning architectures and the proposed method does not require any such assumption.

下载PDF全文

下载文献需遵守相关版权规定

论文标题