论文标题
转移学习端到端的阿拉伯遗产到语音(TTS)深度体系结构
A Transfer Learning End-to-End ArabicText-To-Speech (TTS) Deep Architecture
论文作者
论文摘要
语音综合是人类言语的人工产生。典型的文本到语音系统将语言文本转换为波形。存在许多英语TTS系统,它们产生成熟,天然和类似人类的语音合成器。相反,直到最近才考虑包括阿拉伯语在内的其他语言。现有的阿拉伯语语音合成解决方案较慢,质量低,综合语音的自然性不如英国合成器。他们还缺乏基本的言语关键因素,例如语调,压力和节奏。提出了不同的工作来解决这些问题,包括使用串联方法,例如单位选择或参数方法。但是,他们需要很多费力的工作和领域专业知识。阿拉伯语语音合成器表现如此糟糕的另一个原因是缺乏语音语料库,与英语不同,具有许多公开可用的语料库和有声读物。这项工作描述了如何使用端到端的神经深网架构来产生高质量,自然和人类的阿拉伯语音。这项工作仅使用$ \ langle $ text,音频$ \ rangle $ pairs,带有相对较少的记录音频样本,总计2.41小时。它说明了如何使用英语字符嵌入,尽管使用了变性阿拉伯字符作为输入以及如何预处理这些音频样本以获得最佳结果。
Speech synthesis is the artificial production of human speech. A typical text-to-speech system converts a language text into a waveform. There exist many English TTS systems that produce mature, natural, and human-like speech synthesizers. In contrast, other languages, including Arabic, have not been considered until recently. Existing Arabic speech synthesis solutions are slow, of low quality, and the naturalness of synthesized speech is inferior to the English synthesizers. They also lack essential speech key factors such as intonation, stress, and rhythm. Different works were proposed to solve those issues, including the use of concatenative methods such as unit selection or parametric methods. However, they required a lot of laborious work and domain expertise. Another reason for such poor performance of Arabic speech synthesizers is the lack of speech corpora, unlike English that has many publicly available corpora and audiobooks. This work describes how to generate high quality, natural, and human-like Arabic speech using an end-to-end neural deep network architecture. This work uses just $\langle$ text, audio $\rangle$ pairs with a relatively small amount of recorded audio samples with a total of 2.41 hours. It illustrates how to use English character embedding despite using diacritic Arabic characters as input and how to preprocess these audio samples to achieve the best results.