论文标题
调查中文拼写检查的字形语音信息:有效和下一步
Investigating Glyph Phonetic Information for Chinese Spell Checking: What Works and What's Next
论文作者
论文摘要
尽管预培训的中文模型在各种NLP任务上表现出令人印象深刻的表现,但中国咒语检查(CSC)任务仍然是一个挑战。先前的研究已经使用诸如字形和语音学之类的信息来探索,以提高区分拼写错误的字符的能力,并取得良好的结果。但是,这些模型的概括能力尚不清楚:目前尚不清楚它们是否合并了字形表达信息,如果是的,是否完全利用了此信息。在本文中,我们旨在更好地了解字形表达信息在CSC任务中的作用,并提出改进的方向。此外,我们为测试CSC模型的普遍性提出了一个新的,更具挑战性和实用性的环境。所有代码均可公开使用。
While pre-trained Chinese language models have demonstrated impressive performance on a wide range of NLP tasks, the Chinese Spell Checking (CSC) task remains a challenge. Previous research has explored using information such as glyphs and phonetics to improve the ability to distinguish misspelled characters, with good results. However, the generalization ability of these models is not well understood: it is unclear whether they incorporate glyph-phonetic information and, if so, whether this information is fully utilized. In this paper, we aim to better understand the role of glyph-phonetic information in the CSC task and suggest directions for improvement. Additionally, we propose a new, more challenging, and practical setting for testing the generalizability of CSC models. All code is made publicly available.