CopyCat2：多演讲者TTS和多一到多的细粒韵律转移的单一型号

论文标题

CopyCat2：多演讲者TTS和多一到多的细粒韵律转移的单一型号

CopyCat2: A Single Model for Multi-Speaker TTS and Many-to-Many Fine-Grained Prosody Transfer

论文作者

Karlapati, Sri, Karanasou, Penny, Lajszczak, Mateusz, Abbas, Ammar, Moinet, Alexis, Makarov, Peter, Li, Ray, van Korlaar, Arent, Slangen, Simon, Drugman, Thomas

论文摘要

在本文中，我们介绍了CopyCat2（CC2），这是一个能够：a）与不同的说话者身份合成语音的新型模型，b）在任何一对可见的说话者之间以细粒度的水平转移韵律，以表达和上下文适当的韵律产生语音。我们通过激活网络的不同部分来完成此操作。我们使用新型的两阶段训练方法来训练模型。在第一阶段，该模型从语音中学习了与言语无关的单词级别的韵律表示，它用于多到许多细粒度的韵律转移。在第二阶段，我们学会使用文本中可用的上下文信息来预测这些韵律表示，从而使多演讲者tts具有上下文适当的韵律。我们将CC2与两个强大的基线相提并论，一个在TTS中与上下文适当的韵律进行了比较，另一种是细粒度的韵律转移。 CC2将我们的基线和复制合成的演讲之间的自然差距减少了22.79美元\％$。在细粒度的韵律转移评估中，它在目标扬声器相似性中获得了$ 33.15 \％$的相对改善。

In this paper, we present CopyCat2 (CC2), a novel model capable of: a) synthesizing speech with different speaker identities, b) generating speech with expressive and contextually appropriate prosody, and c) transferring prosody at fine-grained level between any pair of seen speakers. We do this by activating distinct parts of the network for different tasks. We train our model using a novel approach to two-stage training. In Stage I, the model learns speaker-independent word-level prosody representations from speech which it uses for many-to-many fine-grained prosody transfer. In Stage II, we learn to predict these prosody representations using the contextual information available in text, thereby, enabling multi-speaker TTS with contextually appropriate prosody. We compare CC2 to two strong baselines, one in TTS with contextually appropriate prosody, and one in fine-grained prosody transfer. CC2 reduces the gap in naturalness between our baseline and copy-synthesised speech by $22.79\%$. In fine-grained prosody transfer evaluations, it obtains a relative improvement of $33.15\%$ in target speaker similarity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题