一个简单的学习优化器的警卫

论文标题

一个简单的学习优化器的警卫

A Simple Guard for Learned Optimizers

论文作者

Prémont-Schwarz, Isabeau, Vítků, Jaroslav, Feyereisl, Jan

论文摘要

如果学习组件的趋势最终优于其手工制作的版本，那么学习的优化器最终将优于SGD或ADAM等手工制作的优化器。即使学到的优化器（L2OS）最终在实践中超过手工制作的优化器，但是它们仍然没有被证明是收敛的，并且可能会失败。这些是这里解决的问题。目前，学习的优化器经常优于学习开始时的仿制药（例如梯度下降），但是一段时间后它们通常会在一段时间后平稳，而通用算法继续取得进步，并且经常超过了伊索普（Asep）的乌龟，超过了野兔。 L2OS仍然很难概括分布式。 Heaton等。提议的保护L2O（GL2O）可以采用学习的优化器并使用通用学习算法进行维护，以便通过有条件地在两者之间切换，可以证明所得算法可以收敛。我们提出了一类新的保障L2O，称为损失的L2O（LGL2O），这在概念上既简单又便宜。守卫机制仅基于两个优化器的预期未来损失价值决定。此外，与GL2O和其他基准相比，我们显示了LGL2O的收敛保证和经验结果的理论证明，这表明它结合了L2O和SGD的最佳状态，而实际上它比GL2O的收敛要好得多。

If the trend of learned components eventually outperforming their hand-crafted version continues, learned optimizers will eventually outperform hand-crafted optimizers like SGD or Adam. Even if learned optimizers (L2Os) eventually outpace hand-crafted ones in practice however, they are still not provably convergent and might fail out of distribution. These are the questions addressed here. Currently, learned optimizers frequently outperform generic hand-crafted optimizers (such as gradient descent) at the beginning of learning but they generally plateau after some time while the generic algorithms continue to make progress and often overtake the learned algorithm as Aesop's tortoise which overtakes the hare. L2Os also still have a difficult time generalizing out of distribution. Heaton et al. proposed Safeguarded L2O (GL2O) which can take a learned optimizer and safeguard it with a generic learning algorithm so that by conditionally switching between the two, the resulting algorithm is provably convergent. We propose a new class of Safeguarded L2O, called Loss-Guarded L2O (LGL2O), which is both conceptually simpler and computationally less expensive. The guarding mechanism decides solely based on the expected future loss value of both optimizers. Furthermore, we show theoretical proof of LGL2O's convergence guarantee and empirical results comparing to GL2O and other baselines showing that it combines the best of both L2O and SGD and that in practice converges much better than GL2O.

下载PDF全文

下载文献需遵守相关版权规定

论文标题