强大的神经机器翻译：建模拼字和插针变化

论文标题

强大的神经机器翻译：建模拼字和插针变化

Robust Neural Machine Translation: Modeling Orthographic and Interpunctual Variation

论文作者

Bergmanis, Toms, Stafanovičs, Artūrs, Pinnis, Mārcis

论文摘要

神经机器翻译系统通常会在策划的语料库中接受培训，并在面对非标准的拼字法或标点符号时进行中断。但是，对拼写错误和错别字的韧性是至关重要的，因为机器翻译系统用于翻译非正式起源的文本，例如聊天对话，社交媒体帖子和网页。我们提出了一个简单的生成噪声模型，以生成十种不同类型的对抗示例。我们使用这些来增强机器翻译系统的培训数据，并表明，在对嘈杂的数据进行测试时，使用对抗性示例训练的系统几乎和翻译干净的数据时都可以执行，而基线系统的性能下降了2-3个BLEU点。为了衡量机器翻译系统输出的鲁棒性和噪声不变性，我们使用原始句子的翻译及其噪声变体之间的平均翻译编辑速率。使用此措施，我们表明，与经过干净数据训练的基线相比，经过对抗性示例训练的系统平均产生了50％的一致性。

Neural machine translation systems typically are trained on curated corpora and break when faced with non-standard orthography or punctuation. Resilience to spelling mistakes and typos, however, is crucial as machine translation systems are used to translate texts of informal origins, such as chat conversations, social media posts and web pages. We propose a simple generative noise model to generate adversarial examples of ten different types. We use these to augment machine translation systems' training data and show that, when tested on noisy data, systems trained using adversarial examples perform almost as well as when translating clean data, while baseline systems' performance drops by 2-3 BLEU points. To measure the robustness and noise invariance of machine translation systems' outputs, we use the average translation edit rate between the translation of the original sentence and its noised variants. Using this measure, we show that systems trained on adversarial examples on average yield 50% consistency improvements when compared to baselines trained on clean data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题