无监督的单词级别的韵律标记，用于可控语音综合

论文标题

无监督的单词级别的韵律标记，用于可控语音综合

Unsupervised word-level prosody tagging for controllable speech synthesis

论文作者

Guo, Yiwei, Du, Chenpeng, Yu, Kai

论文摘要

尽管在最近的多种语音综合研究中已经研究了神经文本到语音（TTS）中的单词级韵律建模（TTS），但手动控制语音综合而无需具体参考仍然具有挑战性。这主要是由于缺乏单词级韵律标签。在这项工作中，我们提出了一种新的方法，用于使用两个阶段进行无监督的单词水平韵律标记，在该阶段中，我们首先根据他们的语音内容将单词与决策树分组为不同的类型，然后分别使用GMM在每种类型的单词中将韵律集成。该设计基于以下假设：不同类型的单词的韵律（例如长单词）应使用不同的标签集标记。此外，具有派生词级韵律标签的TTS系统接受了可控语音合成的训练。 LJSpeech上的实验表明，经过词级韵律标签训练的TTS模型不仅比典型的FastSpeech2模型获得了更好的自然性，而且还获得了操纵单词级韵律的能力。

Although word-level prosody modeling in neural text-to-speech (TTS) has been investigated in recent research for diverse speech synthesis, it is still challenging to control speech synthesis manually without a specific reference. This is largely due to lack of word-level prosody tags. In this work, we propose a novel approach for unsupervised word-level prosody tagging with two stages, where we first group the words into different types with a decision tree according to their phonetic content and then cluster the prosodies using GMM within each type of words separately. This design is based on the assumption that the prosodies of different type of words, such as long or short words, should be tagged with different label sets. Furthermore, a TTS system with the derived word-level prosody tags is trained for controllable speech synthesis. Experiments on LJSpeech show that the TTS model trained with word-level prosody tags not only achieves better naturalness than a typical FastSpeech2 model, but also gains the ability to manipulate word-level prosody.

下载PDF全文

下载文献需遵守相关版权规定

论文标题