双重平均对于深度学习优化非常有效

论文标题

双重平均对于深度学习优化非常有效

Dual Averaging is Surprisingly Effective for Deep Learning Optimization

论文作者

Jelassi, Samy, Defazio, Aaron

论文摘要

一阶随机优化方法当前是训练深神经网络的最广泛使用的方法。但是，优化器的选择已成为一项临时规则，可以显着影响性能。例如，具有动量的SGD（SGD+M）通常用于计算机视觉（CV），ADAM用于自然语言处理（NLP）的训练变压器模型。使用错误的方法会导致大量的性能降解。受双重平均算法的启发，我们提出了现代化的双重平均（MDA），这是一个优化器，能够在CV中和NLP中的ADAM中的SGD+M和SGD+M一样执行。我们的方法不是自适应，并且比亚当明显简单。我们表明，与Vanilla SGD+M相比，MDA引起了衰减的univentered $ l_2 $ regulination，并假设这可以解释为什么它在SGD+M失败的NLP问题上起作用。

First-order stochastic optimization methods are currently the most widely used class of methods for training deep neural networks. However, the choice of the optimizer has become an ad-hoc rule that can significantly affect the performance. For instance, SGD with momentum (SGD+M) is typically used in computer vision (CV) and Adam is used for training transformer models for Natural Language Processing (NLP). Using the wrong method can lead to significant performance degradation. Inspired by the dual averaging algorithm, we propose Modernized Dual Averaging (MDA), an optimizer that is able to perform as well as SGD+M in CV and as Adam in NLP. Our method is not adaptive and is significantly simpler than Adam. We show that MDA induces a decaying uncentered $L_2$-regularization compared to vanilla SGD+M and hypothesize that this may explain why it works on NLP problems where SGD+M fails.

下载PDF全文

下载文献需遵守相关版权规定

论文标题