子词删除的缩放采样，以获得更好的角色级翻译

论文标题

子词删除的缩放采样，以获得更好的角色级翻译

Subword-Delimited Downsampling for Better Character-Level Translation

论文作者

Edman, Lukas, Toral, Antonio, van Noord, Gertjan

论文摘要

子词级模型一直是NLP中的主要范例。但是，字符级别的模型具有单独查看每个角色的好处，为模型提供了更详细的信息，最终可能会导致更好的模型。最近的作品显示了角色级模型与子词模型具有竞争力，但在时间和计算方面昂贵。角色级别的模型具有下采样组件的模型可以减轻这一点，但却以质量为代价，尤其是用于机器翻译。这项工作分析了以前的下采样方法的问题，并引入了一种新颖的下采样方法，该方法通过子词告知。这种新的下采样方法不仅胜过现有的下采样方法，还表明可以在不牺牲质量的情况下完成降采样的字符，而且与翻译的子词模型相比，还可以带来有希望的性能。

Subword-level models have been the dominant paradigm in NLP. However, character-level models have the benefit of seeing each character individually, providing the model with more detailed information that ultimately could lead to better models. Recent works have shown character-level models to be competitive with subword models, but costly in terms of time and computation. Character-level models with a downsampling component alleviate this, but at the cost of quality, particularly for machine translation. This work analyzes the problems of previous downsampling methods and introduces a novel downsampling method which is informed by subwords. This new downsampling method not only outperforms existing downsampling methods, showing that downsampling characters can be done without sacrificing quality, but also leads to promising performance compared to subword models for translation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题