论文标题
珍爱者EM+Prune:改进的子词细分,期望最大化和修剪
Morfessor EM+Prune: Improved Subword Segmentation with Expectation Maximization and Pruning
论文作者
论文摘要
数据驱动的单词将单词单词分割为子单词单元已在各种自然语言处理应用中使用了近20年的自动语音识别和统计机器翻译等各种自然语言处理应用程序。最近,它变得更加广泛地采用,因为基于深层神经网络的模型甚至从形态上简单的语言中也受益于子词单元。在本文中,我们根据期望最大化算法和词典修剪讨论并比较了Unigram子字模型的培训算法。使用英语,芬兰,北萨米人和土耳其数据集,我们表明这种方法能够找到比其原始递归训练算法所定义的优化问题更好的解决方案。与语言黄金标准相比,改进的优化还会导致更高的形态分割精度。我们在广泛使用的奉献软件包中发布了新算法的实现。
Data-driven segmentation of words into subword units has been used in various natural language processing applications such as automatic speech recognition and statistical machine translation for almost 20 years. Recently it has became more widely adopted, as models based on deep neural networks often benefit from subword units even for morphologically simpler languages. In this paper, we discuss and compare training algorithms for a unigram subword model, based on the Expectation Maximization algorithm and lexicon pruning. Using English, Finnish, North Sami, and Turkish data sets, we show that this approach is able to find better solutions to the optimization problem defined by the Morfessor Baseline model than its original recursive training algorithm. The improved optimization also leads to higher morphological segmentation accuracy when compared to a linguistic gold standard. We publish implementations of the new algorithms in the widely-used Morfessor software package.