使用经常性神经网络将犹太阿拉伯文本音译为阿拉伯文字

论文标题

使用经常性神经网络将犹太阿拉伯文本音译为阿拉伯文字

Transliteration of Judeo-Arabic Texts into Arabic Script Using Recurrent Neural Networks

论文作者

Terner, Ori, Bar, Kfir, Dershowitz, Nachum

论文摘要

我们训练了一个模型，可以将犹太阿拉伯文字自动翻译成阿拉伯语脚本，使阿拉伯语读者能够访问这些著作。我们采用了一个经常性的神经网络（RNN），并结合了连接主义者时间分类（CTC）损失来处理不等的输入/输出长度。这有义务调整训练数据，以避免输入序列比其相应的输出短。我们还利用具有不同损耗函数的预处理阶段来改善网络收敛。由于只有单个并行文本可用于培训，因此我们利用合成数据生成数据的可能性。我们训练一个模型，该模型能够记住输出语言中的单词，并且还利用上下文来区分音译中的歧义。我们获得了基线9.5％的字符误差的改进，并通过我们的最佳配置实现了2％的误差。为了衡量上下文对学习的贡献，我们还测试了单词调整的数据，该数据的错误将上升到2.5％。

We trained a model to automatically transliterate Judeo-Arabic texts into Arabic script, enabling Arabic readers to access those writings. We employ a recurrent neural network (RNN), combined with the connectionist temporal classification (CTC) loss to deal with unequal input/output lengths. This obligates adjustments in the training data to avoid input sequences that are shorter than their corresponding outputs. We also utilize a pretraining stage with a different loss function to improve network converge. Since only a single source of parallel text was available for training, we take advantage of the possibility of generating data synthetically. We train a model that has the capability to memorize words in the output language, and that also utilizes context for distinguishing ambiguities in the transliteration. We obtain an improvement over the baseline 9.5% character error, achieving 2% error with our best configuration. To measure the contribution of context to learning, we also tested word-shuffled data, for which the error rises to 2.5%.

下载PDF全文

下载文献需遵守相关版权规定

论文标题