通过近端方法和规模繁琐理解ADAMW

论文标题

通过近端方法和规模繁琐理解ADAMW

Understanding AdamW through Proximal Methods and Scale-Freeness

论文作者

Zhuang, Zhenxun, Liu, Mingrui, Cutkosky, Ashok, Orabona, Francesco

论文摘要

由于超级参数的调整和出色的性能，亚当已被广泛用于训练深层神经网络。为了改善概括，亚当通常与平方$ \ ell_2 $正规器（称为adam-$ \ ell_2 $）同时使用。但是，ADAMW可以获得更好的性能，Adamw可以将正规器的梯度与Adam-$ \ ell_2 $的更新规则相反。但是，我们仍然缺乏对Adamw优势的完整解释。在本文中，我们从优化和经验的角度解决了这个问题。首先，我们展示了如何重新解释ADAMW作为近端梯度方法的近似值，该方法利用了正常化程序的封闭形式近端映射，而不仅仅是使用其梯度信息，如Adam-$ \ ell_2 $。接下来，我们考虑Adamw及其近端对应物所享有的“规模柔软”的属性：它们的更新对于梯度的组件重新缩放是不变的。我们在广泛的深度学习实验中提供了经验证据，表明ADAMW在ADAM上表现出优势的问题与Adam-$ \ ell_2 $之间的相关性与我们期望网络梯度表现出多个量表的程度，从而激发了ADAMW的优势是由于规模无限而引起的。

Adam has been widely adopted for training deep neural networks due to less hyperparameter tuning and remarkable performance. To improve generalization, Adam is typically used in tandem with a squared $\ell_2$ regularizer (referred to as Adam-$\ell_2$). However, even better performance can be obtained with AdamW, which decouples the gradient of the regularizer from the update rule of Adam-$\ell_2$. Yet, we are still lacking a complete explanation of the advantages of AdamW. In this paper, we tackle this question from both an optimization and an empirical point of view. First, we show how to re-interpret AdamW as an approximation of a proximal gradient method, which takes advantage of the closed-form proximal mapping of the regularizer instead of only utilizing its gradient information as in Adam-$\ell_2$. Next, we consider the property of "scale-freeness" enjoyed by AdamW and by its proximal counterpart: their updates are invariant to component-wise rescaling of the gradients. We provide empirical evidence across a wide range of deep learning experiments showing a correlation between the problems in which AdamW exhibits an advantage over Adam-$\ell_2$ and the degree to which we expect the gradients of the network to exhibit multiple scales, thus motivating the hypothesis that the advantage of AdamW could be due to the scale-free updates.

下载PDF全文

下载文献需遵守相关版权规定

论文标题