通过无监督的关联措施改善中文无细分单词嵌入

论文标题

通过无监督的关联措施改善中文无细分单词嵌入

Improving Chinese Segmentation-free Word Embedding With Unsupervised Association Measure

论文作者

Zhang, Yifan, Wang, Maohua, Huang, Yongjian, Gu, Qianrong

论文摘要

关于无细分单词嵌入（SEMBEI）的最新工作开发了一条新的单词嵌入方式的管道，同时避免细分作为预处理步骤。但是，嵌入词汇中存在的太多嘈杂的n-grams在字符之间没有强大的关联强度会限制学习单词嵌入的质量。为了解决这个问题，提出了一种新版本的无细分单词嵌入模型，是通过通过一种新颖的无监督关联措施来收集n-gram词汇，称为尖锐的关联与时代信息（PATI）。与常用的N-Gram滤波方法相比，例如Sembei和Pointwisce共同信息中使用的频率（PMI），该方法利用了来自语料库的更多潜在信息，因此能够收集更有效的N-gram，这些n-gram具有更强的凝聚力作为在未分段语言数据中嵌入目标的嵌入目标，例如中文文本。有关中国SNS数据的进一步实验表明，所提出的模型改善了下游任务中单词嵌入的性能。

Recent work on segmentation-free word embedding(sembei) developed a new pipeline of word embedding for unsegmentated language while avoiding segmentation as a preprocessing step. However, too many noisy n-grams existing in the embedding vocabulary that do not have strong association strength between characters would limit the quality of learned word embedding. To deal with this problem, a new version of segmentation-free word embedding model is proposed by collecting n-grams vocabulary via a novel unsupervised association measure called pointwise association with times information(PATI). Comparing with the commonly used n-gram filtering method like frequency used in sembei and pointwise mutual information(PMI), the proposed method leverages more latent information from the corpus and thus is able to collect more valid n-grams that have stronger cohesion as embedding targets in unsegmented language data, such as Chinese texts. Further experiments on Chinese SNS data show that the proposed model improves performance of word embedding in downstream tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题