低资源语言的交互知识基础拼写校正模型的比较

论文标题

低资源语言的交互知识基础拼写校正模型的比较

Comparison of Interactive Knowledge Base Spelling Correction Models for Low-Resource Languages

论文作者

Li, Yiyuan, Anastasopoulos, Antonios, Black, Alan W

论文摘要

低资源语言的拼写归一化是一项艰巨的任务，因为很难预测模式，通常需要大型语料库来收集足够的示例。这项工作显示了与目标语言数据有关的神经模型和角色语言模型的比较。我们的用法场景是交互式校正的，培训示例几乎为零，在收集更多数据时改进模型，例如在聊天应用程序中。此类模型被设计为逐步改进，因为用户提供了反馈。在这项工作中，我们设计了一种知识库和预测模型嵌入式系统，用于低资源语言的拼写校正。多种语言的实验结果表明，该模型可以通过少量数据有效。我们对自然数据和合成数据以及两种濒危语言（Ainu和Griko）的数据进行实验。最后，我们建立了一个原型系统，该系统用于Hinglish的小案例研究，这进一步证明了我们在现实世界中的适用性。

Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict and large corpora are usually required to collect enough examples. This work shows a comparison of a neural model and character language models with varying amounts on target language data. Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected, for example within a chat app. Such models are designed to be incrementally improved as feedback is given from users. In this work, we design a knowledge-base and prediction model embedded system for spelling correction in low-resource languages. Experimental results on multiple languages show that the model could become effective with a small amount of data. We perform experiments on both natural and synthetic data, as well as on data from two endangered languages (Ainu and Griko). Last, we built a prototype system that was used for a small case study on Hinglish, which further demonstrated the suitability of our approach in real world scenarios.

下载PDF全文

下载文献需遵守相关版权规定

论文标题