NMT中的数据扩展定律：噪声和体系结构的影响

论文标题

NMT中的数据扩展定律：噪声和体系结构的影响

Data Scaling Laws in NMT: The Effect of Noise and Architecture

论文作者

Bansal, Yamini, Ghorbani, Behrooz, Garg, Ankush, Zhang, Biao, Krikun, Maxim, Cherry, Colin, Neyshabur, Behnam, Firat, Orhan

论文摘要

在这项工作中，我们研究了改变架构和培训数据质量对神经机器翻译（NMT）数据扩展特性的影响。首先，我们确定编码器 - 码头变压器模型的测试损失在训练样本的数量中缩放为功率定律，并且依赖于模型大小。然后，我们从系统地改变了培训设置的各个方面，以了解它们如何影响数据扩展定律。特别是，我们更改以下（1）体系结构和任务设置：我们将变压器-LSTM混合动力车和仅在训练分布中具有语言建模损失（2）噪声水平的仅解码器的变压器进行比较：我们尝试过滤，并添加IID合成噪声。在上述所有情况下，我们发现数据扩展指数受到最小的影响，这表明可以通过添加更多数据来补偿较差的架构或培训数据。最后，我们发现使用反翻译数据代替并行数据，可以大大降低缩放指数。

In this work, we study the effect of varying the architecture and training data quality on the data scaling properties of Neural Machine Translation (NMT). First, we establish that the test loss of encoder-decoder transformer models scales as a power law in the number of training samples, with a dependence on the model size. Then, we systematically vary aspects of the training setup to understand how they impact the data scaling laws. In particular, we change the following (1) Architecture and task setup: We compare to a transformer-LSTM hybrid, and a decoder-only transformer with a language modeling loss (2) Noise level in the training distribution: We experiment with filtering, and adding iid synthetic noise. In all the above cases, we find that the data scaling exponents are minimally impacted, suggesting that marginally worse architectures or training data can be compensated for by adding more data. Lastly, we find that using back-translated data instead of parallel data, can significantly degrade the scaling exponent.

下载PDF全文

下载文献需遵守相关版权规定

论文标题