论文标题
通过句法依赖性预测任务改进预训练的语言模型,用于中文语义错误识别
Improving Pre-trained Language Models with Syntactic Dependency Prediction Task for Chinese Semantic Error Recognition
论文作者
论文摘要
现有的中国文本错误检测主要集中在拼写和简单的语法错误上。这些错误已经进行了广泛的研究,对于人类而言相对简单。相反,中国的语义错误被研究了,更复杂的是人类无法轻易识别。本文的任务是中国语义错误识别(CSER),这是一个二进制分类任务,以确定句子是否包含语义错误。当前的研究没有有效的方法来解决这项任务。在本文中,我们继承了BERT的模型结构,并设计了几种与语法相关的预训练任务,以便模型可以学习句法知识。我们的预训练任务既考虑依赖性结构的方向性,又考虑依赖关系的多样性。由于缺乏针对CSER的已发布数据集,我们首次为CSER构建了一个名为中国语言语义可接受性语料库的高质量数据集(COCLSA)。 COCLSA上的实验结果表明,我们的方法优于通用预训练的模型和语法注入的模型。
Existing Chinese text error detection mainly focuses on spelling and simple grammatical errors. These errors have been studied extensively and are relatively simple for humans. On the contrary, Chinese semantic errors are understudied and more complex that humans cannot easily recognize. The task of this paper is Chinese Semantic Error Recognition (CSER), a binary classification task to determine whether a sentence contains semantic errors. The current research has no effective method to solve this task. In this paper, we inherit the model structure of BERT and design several syntax-related pre-training tasks so that the model can learn syntactic knowledge. Our pre-training tasks consider both the directionality of the dependency structure and the diversity of the dependency relationship. Due to the lack of a published dataset for CSER, we build a high-quality dataset for CSER for the first time named Corpus of Chinese Linguistic Semantic Acceptability (CoCLSA). The experimental results on the CoCLSA show that our methods outperform universal pre-trained models and syntax-infused models.