Armelief Optimizer：通过对观察到的梯度的信念来调整步骤

论文标题

Armelief Optimizer：通过对观察到的梯度的信念来调整步骤

AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients

论文作者

Zhuang, Juntang, Tang, Tommy, Ding, Yifan, Tatikonda, Sekhar, Dvornek, Nicha, Papademetris, Xenophon, Duncan, James S.

论文摘要

最受欢迎的深度学习优化者可以广泛地归类为自适应方法（例如ADAM）和加速方案（例如，具有动量的随机梯度下降（SGD））。对于许多模型，例如卷积神经网络（CNN），自适应方法通常会更快地收敛，但与SGD相比，概括较差。对于诸如生成对抗网络（GAN）之类的复杂设置，自适应方法通常是默认的，因为它们的稳定性。我们提出了适应性的，以同时实现三个目标：快速收敛，例如在自适应方法中，如SGD中的良好概括和训练稳定性。适应性的直觉是根据当前梯度方向的“信念”来调整步骤大小。将嘈杂梯度的指数移动平均值（EMA）视为下一个时间步骤的梯度的预测，如果观察到的梯度远离预测，我们不信任当前的观察并采取一小步；如果观察到的梯度接近预测，我们会信任它并迈出很大一步。我们在广泛的实验中验证了适应性，表明它在图像分类和语言建模方面胜过其他方法的其他方法。具体而言，在Imagenet上，Anbelief具有与SGD可比的准确性。此外，在CIFAR10上的GAN训练时，AbeLief表现出很高的稳定性，并且与经过良好的ADAM优化器相比，ADABLEIF的质量提高了生成的样品质量。代码可从https://github.com/juntang-zhuang/adabelief-optimizer获得

Most popular optimizers for deep learning can be broadly categorized as adaptive methods (e.g. Adam) and accelerated schemes (e.g. stochastic gradient descent (SGD) with momentum). For many models such as convolutional neural networks (CNNs), adaptive methods typically converge faster but generalize worse compared to SGD; for complex settings such as generative adversarial networks (GANs), adaptive methods are typically the default because of their stability.We propose AdaBelief to simultaneously achieve three goals: fast convergence as in adaptive methods, good generalization as in SGD, and training stability. The intuition for AdaBelief is to adapt the stepsize according to the "belief" in the current gradient direction. Viewing the exponential moving average (EMA) of the noisy gradient as the prediction of the gradient at the next time step, if the observed gradient greatly deviates from the prediction, we distrust the current observation and take a small step; if the observed gradient is close to the prediction, we trust it and take a large step. We validate AdaBelief in extensive experiments, showing that it outperforms other methods with fast convergence and high accuracy on image classification and language modeling. Specifically, on ImageNet, AdaBelief achieves comparable accuracy to SGD. Furthermore, in the training of a GAN on Cifar10, AdaBelief demonstrates high stability and improves the quality of generated samples compared to a well-tuned Adam optimizer. Code is available at https://github.com/juntang-zhuang/Adabelief-Optimizer

下载PDF全文

下载文献需遵守相关版权规定

论文标题