论文标题

泰米尔语和卡纳达语的自动语音识别的知识驱动子字语法建模

Knowledge-driven Subword Grammar Modeling for Automatic Speech Recognition in Tamil and Kannada

论文作者

A, Madhavaraj, Pilar, Bharathi, G, Ramakrishnan A

论文摘要

在本文中,我们为泰米尔语和卡纳达语高度凝结和柔和的语言提供了专门设计的自动语音识别(ASR)系统,这些系统可以识别单词的无限词汇。我们使用子字作为识别和构建子字的基本词汇单元,用于捕获语言的大多数复杂的单词形成规则,用于单词分割。我们已经确定了以下单词(i)动词,(ii)名词,(ii)代词和(iv)数字的类别。为每个类别创建了子字的前缀,infix和后缀列表,并用于设计SG-WFST图。我们还提出了一种启发式分割算法,该算法甚至可以分割不遵循SG-WFST中封装的规则的例外单词。大多数数据驱动的子词字典创建算法都是计算驱动的,因此不能保证类似词素的单元,因此我们使用了语言的语言知识,并手动创建了子单词词典和图形。最后,我们训练深度神经网络声学模型,并将其与子单词词典和SG-WFST Graph的发音词典相结合,以构建子字 - ASR系统。由于子词-ASR会产生子字序列作为给定测试语音的输出,因此我们将其后处理以获取最终单词序列,因此可以识别的实际单词数量要高得多。在使用IISC英里泰米尔语和Kannada ASR Corpora实验子词-ASR系统后,我们分别在基于泰米尔语和卡纳达语的基线ASR系统中观察到绝对单词错误率降低了12.39%和13.56%。

In this paper, we present specially designed automatic speech recognition (ASR) systems for the highly agglutinative and inflective languages of Tamil and Kannada that can recognize unlimited vocabulary of words. We use subwords as the basic lexical units for recognition and construct subword grammar weighted finite state transducer (SG-WFST) graphs for word segmentation that captures most of the complex word formation rules of the languages. We have identified the following category of words (i) verbs, (ii) nouns, (ii) pronouns, and (iv) numbers. The prefix, infix and suffix lists of subwords are created for each of these categories and are used to design the SG-WFST graphs. We also present a heuristic segmentation algorithm that can even segment exceptional words that do not follow the rules encapsulated in the SG-WFST graph. Most of the data-driven subword dictionary creation algorithms are computation driven, and hence do not guarantee morpheme-like units and so we have used the linguistic knowledge of the languages and manually created the subword dictionaries and the graphs. Finally, we train a deep neural network acoustic model and combine it with the pronunciation lexicon of the subword dictionary and the SG-WFST graph to build the subword-ASR systems. Since the subword-ASR produces subword sequences as output for a given test speech, we post-process its output to get the final word sequence, so that the actual number of words that can be recognized is much higher. Upon experimenting the subword-ASR system with the IISc-MILE Tamil and Kannada ASR corpora, we observe an absolute word error rate reduction of 12.39% and 13.56% over the baseline word-based ASR systems for Tamil and Kannada, respectively.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源