LRSpeech：极低的资源语音综合和认可

论文标题

LRSpeech：极低的资源语音综合和认可

LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition

论文作者

Xu, Jin, Tan, Xu, Ren, Yi, Qin, Tao, Li, Jian, Zhao, Sheng, Liu, Tie-Yan

论文摘要

语音综合（语音，TTS）和识别（自动语音识别，ASR）是重要的语音任务，需要大量文本和语音对进行模型培训。但是，世界上有6,000多种语言，大多数语言缺乏语音培训数据，这在为极低的资源语言建立TTS和ASR系统时会带来重大挑战。在本文中，我们在极低的资源设置下开发了LRSpeech，TTS和ASR系统，该设置可以支持低数据成本的稀有语言。 LRSpeech由三种关键技术组成：1）对富裕的语言进行预培训，并对低资源语言进行微调； 2）TTS和ASR之间的双重变换迭代地提高了彼此的准确性； 3）知识蒸馏以在高质量的宣传语音语音上自定义TTS模型，并在多种声音上改善ASR模型。我们对实验语言（英语）和真正低资源语言（立陶宛语）进行实验，以验证LRSpeech的有效性。实验结果表明，LRSPEECH 1）在可理解性（超过98％的清晰度率）和自然性（超过3.5的平均意见评分（MOS））方面，可以达到TT的高质量，这满足了工业部署的需求，2）实现了ASR的良好认识准确性，并使用了最低限度的训练，但使用了极低的训练。我们还对具有不同数据资源的LRSpeech进行了全面的分析，并为工业部署提供了宝贵的见解和指导。我们目前正在将LRSpeech部署到商业化的云语音服务中，以支持更罕见的语言。

Speech synthesis (text to speech, TTS) and recognition (automatic speech recognition, ASR) are important speech tasks, and require a large amount of text and speech pairs for model training. However, there are more than 6,000 languages in the world and most languages are lack of speech training data, which poses significant challenges when building TTS and ASR systems for extremely low-resource languages. In this paper, we develop LRSpeech, a TTS and ASR system under the extremely low-resource setting, which can support rare languages with low data cost. LRSpeech consists of three key techniques: 1) pre-training on rich-resource languages and fine-tuning on low-resource languages; 2) dual transformation between TTS and ASR to iteratively boost the accuracy of each other; 3) knowledge distillation to customize the TTS model on a high-quality target-speaker voice and improve the ASR model on multiple voices. We conduct experiments on an experimental language (English) and a truly low-resource language (Lithuanian) to verify the effectiveness of LRSpeech. Experimental results show that LRSpeech 1) achieves high quality for TTS in terms of both intelligibility (more than 98% intelligibility rate) and naturalness (above 3.5 mean opinion score (MOS)) of the synthesized speech, which satisfy the requirements for industrial deployment, 2) achieves promising recognition accuracy for ASR, and 3) last but not least, uses extremely low-resource training data. We also conduct comprehensive analyses on LRSpeech with different amounts of data resources, and provide valuable insights and guidances for industrial deployment. We are currently deploying LRSpeech into a commercialized cloud speech service to support TTS on more rare languages.

下载PDF全文

下载文献需遵守相关版权规定

论文标题