论文标题
ADAX:具有指数长期内存的自适应梯度下降
AdaX: Adaptive Gradient Descent with Exponential Long Term Memory
论文作者
论文摘要
尽管Adam等自适应优化算法在许多机器学习任务中表现出快速的收敛性,但本文通过在简单的非Convex合成问题中分析其性能来确定ADAM的问题,这表明Adam的快速收敛可能会导致算法将算法带到本地最低限度。为了解决这个问题,我们通过提出一种名为Adax的新型自适应梯度下降算法来改善亚当。与亚当忽略了过去的梯度不同,Adax在培训期间过去逐步积累了长期梯度信息,以适应学习率。我们彻底证明了在凸和非凸设置中ADAX的收敛性。广泛的实验表明,ADAX在计算机视觉和自然语言处理的各种任务中都优于亚当,并且可以赶上随机梯度下降。
Although adaptive optimization algorithms such as Adam show fast convergence in many machine learning tasks, this paper identifies a problem of Adam by analyzing its performance in a simple non-convex synthetic problem, showing that Adam's fast convergence would possibly lead the algorithm to local minimums. To address this problem, we improve Adam by proposing a novel adaptive gradient descent algorithm named AdaX. Unlike Adam that ignores the past gradients, AdaX exponentially accumulates the long-term gradient information in the past during training, to adaptively tune the learning rate. We thoroughly prove the convergence of AdaX in both the convex and non-convex settings. Extensive experiments show that AdaX outperforms Adam in various tasks of computer vision and natural language processing and can catch up with Stochastic Gradient Descent.