ADAX：具有指数长期内存的自适应梯度下降

论文标题

ADAX：具有指数长期内存的自适应梯度下降

AdaX: Adaptive Gradient Descent with Exponential Long Term Memory

论文作者

Li, Wenjie, Zhang, Zhaoyang, Wang, Xinjiang, Luo, Ping

论文摘要

尽管Adam等自适应优化算法在许多机器学习任务中表现出快速的收敛性，但本文通过在简单的非Convex合成问题中分析其性能来确定ADAM的问题，这表明Adam的快速收敛可能会导致算法将算法带到本地最低限度。为了解决这个问题，我们通过提出一种名为Adax的新型自适应梯度下降算法来改善亚当。与亚当忽略了过去的梯度不同，Adax在培训期间过去逐步积累了长期梯度信息，以适应学习率。我们彻底证明了在凸和非凸设置中ADAX的收敛性。广泛的实验表明，ADAX在计算机视觉和自然语言处理的各种任务中都优于亚当，并且可以赶上随机梯度下降。

Although adaptive optimization algorithms such as Adam show fast convergence in many machine learning tasks, this paper identifies a problem of Adam by analyzing its performance in a simple non-convex synthetic problem, showing that Adam's fast convergence would possibly lead the algorithm to local minimums. To address this problem, we improve Adam by proposing a novel adaptive gradient descent algorithm named AdaX. Unlike Adam that ignores the past gradients, AdaX exponentially accumulates the long-term gradient information in the past during training, to adaptively tune the learning rate. We thoroughly prove the convergence of AdaX in both the convex and non-convex settings. Extensive experiments show that AdaX outperforms Adam in various tasks of computer vision and natural language processing and can catch up with Stochastic Gradient Descent.

下载PDF全文

下载文献需遵守相关版权规定

论文标题