2Kenize：中文脚本转换的绑定子字序列

论文标题

2Kenize：中文脚本转换的绑定子字序列

2kenize: Tying Subword Sequences for Chinese Script Conversion

论文作者

A, Pranav, Augenstein, Isabelle

论文摘要

简化中文到传统的汉字转换是中文NLP的常见预处理步骤。尽管如此，目前的方法的性能较差，因为它们没有考虑到简化的汉字可以对应于多个传统角色。在这里，我们提出了一个模型，该模型可以在映射之间消失并在两个脚本之间进行转换。该模型基于子词分割，两个语言模型以及在子字序列之间映射的方法。我们进一步构建了主题分类和脚本转换的基准数据集。我们所提出的方法的表现优于先前的汉字转换方法的准确性6分。这些结果在下游应用程序中得到了进一步确认，其中2Kenize用于转换预处理数据集以进行主题分类。错误分析表明，我们方法的特殊优势是处理代码混合和命名实体。

Simplified Chinese to Traditional Chinese character conversion is a common preprocessing step in Chinese NLP. Despite this, current approaches have poor performance because they do not take into account that a simplified Chinese character can correspond to multiple traditional characters. Here, we propose a model that can disambiguate between mappings and convert between the two scripts. The model is based on subword segmentation, two language models, as well as a method for mapping between subword sequences. We further construct benchmark datasets for topic classification and script conversion. Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy. These results are further confirmed in a downstream application, where 2kenize is used to convert pretraining dataset for topic classification. An error analysis reveals that our method's particular strengths are in dealing with code-mixing and named entities.

下载PDF全文

下载文献需遵守相关版权规定

论文标题