论文标题
训练神经机器翻译模型的软速度回火
Softmax Tempering for Training Neural Machine Translation Models
论文作者
论文摘要
神经机器翻译(NMT)模型通常使用SoftMax横向渗透损失进行训练,在该损失中,在该损失中,与平滑的金标签进行了比较的SoftMax分布。在低资源场景中,NMT模型倾向于过度拟合,因为SoftMax分布迅速接近金标签分布。为了解决这个问题,我们建议在训练期间施加软磁性之前将逻辑除以温度系数。在我们对亚洲语言库数据集和WMT 2019英语翻译任务的11对语言对实验中,我们观察到翻译质量的显着改善提高了3.9个BLEU点。此外,SoftMax的回火使贪婪的搜索与梁搜索在翻译质量方面一样好,可以加速1.5至3.5倍。我们还研究了SoftMax回火对多语言NMT和经常堆叠的NMT的影响,这两者都是通过参数共享来减少NMT模型尺寸的影响,从而验证温度在开发紧凑的NMT模型中的效用。最后,对软磁熵和梯度的分析揭示了我们方法对NMT模型内部行为的影响。
Neural machine translation (NMT) models are typically trained using a softmax cross-entropy loss where the softmax distribution is compared against smoothed gold labels. In low-resource scenarios, NMT models tend to over-fit because the softmax distribution quickly approaches the gold label distribution. To address this issue, we propose to divide the logits by a temperature coefficient, prior to applying softmax, during training. In our experiments on 11 language pairs in the Asian Language Treebank dataset and the WMT 2019 English-to-German translation task, we observed significant improvements in translation quality by up to 3.9 BLEU points. Furthermore, softmax tempering makes the greedy search to be as good as beam search decoding in terms of translation quality, enabling 1.5 to 3.5 times speed-up. We also study the impact of softmax tempering on multilingual NMT and recurrently stacked NMT, both of which aim to reduce the NMT model size by parameter sharing thereby verifying the utility of temperature in developing compact NMT models. Finally, an analysis of softmax entropies and gradients reveal the impact of our method on the internal behavior of NMT models.