论文标题

ADAX:具有指数长期内存的自适应梯度下降

AdaX: Adaptive Gradient Descent with Exponential Long Term Memory

论文作者

Li, Wenjie, Zhang, Zhaoyang, Wang, Xinjiang, Luo, Ping

论文摘要

尽管Adam等自适应优化算法在许多机器学习任务中表现出快速的收敛性,但本文通过在简单的非Convex合成问题中分析其性能来确定ADAM的问题,这表明Adam的快速收敛可能会导致算法将算法带到本地最低限度。为了解决这个问题,我们通过提出一种名为Adax的新型自适应梯度下降算法来改善亚当。与亚当忽略了过去的梯度不同,Adax在培训期间过去逐步积累了长期梯度信息,以适应学习率。我们彻底证明了在凸和非凸设置中ADAX的收敛性。广泛的实验表明,ADAX在计算机视觉和自然语言处理的各种任务中都优于亚当,并且可以赶上随机梯度下降。

Although adaptive optimization algorithms such as Adam show fast convergence in many machine learning tasks, this paper identifies a problem of Adam by analyzing its performance in a simple non-convex synthetic problem, showing that Adam's fast convergence would possibly lead the algorithm to local minimums. To address this problem, we improve Adam by proposing a novel adaptive gradient descent algorithm named AdaX. Unlike Adam that ignores the past gradients, AdaX exponentially accumulates the long-term gradient information in the past during training, to adaptively tune the learning rate. We thoroughly prove the convergence of AdaX in both the convex and non-convex settings. Extensive experiments show that AdaX outperforms Adam in various tasks of computer vision and natural language processing and can catch up with Stochastic Gradient Descent.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源