论文标题

部分可观测时空混沌系统的无模型预测

Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech

论文作者

Jiang, Ziyue, Su, Zhe, Zhao, Zhou, Yang, Qian, Ren, Yi, Liu, Jinglin, Ye, Zhenhui

论文摘要

多人歧义旨在从可靠的文本到语音(TTS)系统中捕获自然文本序列的准确发音知识。但是,以前的方法需要大量注释的培训数据和语言专家的额外努力,因此很难将高质量的神经TTS系统扩展到全球范围内的每日对话和无数语言。本文从简洁而新颖的角度解决了多人歧义问题:我们提出了dict-tts,这是带有在线网站词典(自然语言中现有的现有先验信息)的语义感知的生成文本到语音模型。具体而言,我们设计了语义到预处理注意(S2PA)模块,以匹配输入文本序列和字典中的先前语义之间的语义模式并获得相应的发音; S2PA模块可以轻松地使用端到端TTS模型训练,而无需任何带注释的音素标签。三种语言的实验结果表明,我们的模型在发音精度方面优于几个强大的基线模型,并改善了TTS系统的韵律模型。进一步的广泛分析表明,dict-TT中的每种设计都是有效的。该代码可在\ url {https://github.com/zain-jiang/dict-tts}中获得。

Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech (TTS) systems. However, previous approaches require substantial annotated training data and additional efforts from language experts, making it difficult to extend high-quality neural TTS systems to out-of-domain daily conversations and countless languages worldwide. This paper tackles the polyphone disambiguation problem from a concise and novel perspective: we propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary (the existing prior information in the natural language). Specifically, we design a semantics-to-pronunciation attention (S2PA) module to match the semantic patterns between the input text sequence and the prior semantics in the dictionary and obtain the corresponding pronunciations; The S2PA module can be easily trained with the end-to-end TTS model without any annotated phoneme labels. Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy and improves the prosody modeling of TTS systems. Further extensive analyses demonstrate that each design in Dict-TTS is effective. The code is available at \url{https://github.com/Zain-Jiang/Dict-TTS}.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源