论文标题

神经机器翻译中子词细分的动态编程编码

Dynamic Programming Encoding for Subword Segmentation in Neural Machine Translation

论文作者

He, Xuanli, Haffari, Gholamreza, Norouzi, Mohammad

论文摘要

本文介绍了动态编程编码(DPE),这是一种新的分割算法,用于将句子置于子词单元中。我们将输出句子的子单词分割视为一个潜在变量,应将其边缘化以进行学习和推理。提出了混合字符 - 单词变压器,该变压器可以实现精确的对数边际似然估计和精确的MAP推断,以找到具有最大后验概率的目标分割。 DPE使用轻巧的混合字符 - 单词变压器作为使用动态编程进行分段输出句子的预处理数据的一种手段。机器翻译的经验结果表明,DPE有效地分割了输出句子,可以与BPE辍学结合使用,以进行源句子的随机分割。 DPE比BPE的平均提高(Sennrich等,2016)的平均提高为0.9 BLEU,在包括英语(包括英文,罗马尼亚语,爱沙尼亚语,芬兰人,亨格尼亚人)的几个WMT数据集上,BPE辍学的平均提高为0.55 BLEU(Provilkov et al。,2019)。

This paper introduces Dynamic Programming Encoding (DPE), a new segmentation algorithm for tokenizing sentences into subword units. We view the subword segmentation of output sentences as a latent variable that should be marginalized out for learning and inference. A mixed character-subword transformer is proposed, which enables exact log marginal likelihood estimation and exact MAP inference to find target segmentations with maximum posterior probability. DPE uses a lightweight mixed character-subword transformer as a means of pre-processing parallel data to segment output sentences using dynamic programming. Empirical results on machine translation suggest that DPE is effective for segmenting output sentences and can be combined with BPE dropout for stochastic segmentation of source sentences. DPE achieves an average improvement of 0.9 BLEU over BPE (Sennrich et al., 2016) and an average improvement of 0.55 BLEU over BPE dropout (Provilkov et al., 2019) on several WMT datasets including English <=> (German, Romanian, Estonian, Finnish, Hungarian).

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源