使用BYT5变压器模型纠正变音符号和错别字

论文标题

使用BYT5变压器模型纠正变音符号和错别字

Correcting diacritics and typos with a ByT5 transformer model

论文作者

Stankevičius, Lukas, Lukoševičius, Mantas, Kapočiūtė-Dzikienė, Jurgita, Briedienė, Monika, Krilavičius, Tomas

论文摘要

由于生活和在线通信的快速速度以及英语和Qwerty键盘的普遍性，人们倾向于使用变音符号放弃，因此在其他语言中键入时会出现印刷错误（错别字）。恢复变音和纠正拼写对于适当使用语言使用和对人类和下游算法的文本歧义很重要。但是，这两个问题通常都是单独解决的：最先进的变量恢复方法不能容忍其他错别字，但是经典的拼写检查员也无法充分处理所有的变音符号。在这项工作中，我们通过采用新开发的通用BYT5字节级SEQ2SEQ变压器模型来立即解决这两个问题，该模型不需要特定语言的模型结构。为了进行比较，我们在添加了立陶宛语的基准数据集上对基准数据集进行了变量恢复。实验研究证明，尽管受过较少的数据和较少的数据，但我们的方法能够取得与先前最先进的结果相当的结果（> 98％）。我们的方法还能够以> 76％的精度训练期间看不到的单词来恢复变音术。我们同时进行的变音率修复和错别字校正方法在13种语言上达到了> 94％的α-字精度。它没有直接的竞争对手，并且强烈胜过经典的拼写检查或基于字典的方法。我们还展示了通过更多培训进一步改进的所有准确性。综上所述，这表明了我们建议的方法对更多数据，语言和错误类的巨大现实应用潜力。

Due to the fast pace of life and online communications and the prevalence of English and the QWERTY keyboard, people tend to forgo using diacritics, make typographical errors (typos) when typing in other languages. Restoring diacritics and correcting spelling is important for proper language use and the disambiguation of texts for both humans and downstream algorithms. However, both of these problems are typically addressed separately: the state-of-the-art diacritics restoration methods do not tolerate other typos, but classical spellcheckers also cannot deal adequately with all the diacritics missing. In this work, we tackle both problems at once by employing the newly-developed universal ByT5 byte-level seq2seq transformer model that requires no language-specific model structures. For a comparison, we perform diacritics restoration on benchmark datasets of 12 languages, with the addition of Lithuanian. The experimental investigation proves that our approach is able to achieve results (> 98%) comparable to the previous state-of-the-art, despite being trained less and on fewer data. Our approach is also able to restore diacritics in words not seen during training with > 76% accuracy. Our simultaneous diacritics restoration and typos correction approach reaches > 94% alpha-word accuracy on the 13 languages. It has no direct competitors and strongly outperforms classical spell-checking or dictionary-based approaches. We also demonstrate all the accuracies to further improve with more training. Taken together, this shows the great real-world application potential of our suggested methods to more data, languages, and error classes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题