论文标题

CopyCat2:多演讲者TTS和多一到多的细粒韵律转移的单一型号

CopyCat2: A Single Model for Multi-Speaker TTS and Many-to-Many Fine-Grained Prosody Transfer

论文作者

Karlapati, Sri, Karanasou, Penny, Lajszczak, Mateusz, Abbas, Ammar, Moinet, Alexis, Makarov, Peter, Li, Ray, van Korlaar, Arent, Slangen, Simon, Drugman, Thomas

论文摘要

在本文中,我们介绍了CopyCat2(CC2),这是一个能够:a)与不同的说话者身份合成语音的新型模型,b)在任何一对可见的说话者之间以细粒度的水平转移韵律,以表达和上下文适当的韵律产生语音。我们通过激活网络的不同部分来完成此操作。我们使用新型的两阶段训练方法来训练模型。在第一阶段,该模型从语音中学习了与言语无关的单词级别的韵律表示,它用于多到许多细粒度的韵律转移。在第二阶段,我们学会使用文本中可用的上下文信息来预测这些韵律表示,从而使多演讲者tts具有上下文适当的韵律。我们将CC2与两个强大的基线相提并论,一个在TTS中与上下文适当的韵律进行了比较,另一种是细粒度的韵律转移。 CC2将我们的基线和复制合成的演讲之间的自然差距减少了22.79美元\%$。在细粒度的韵律转移评估中,它在目标扬声器相似性中获得了$ 33.15 \%$的相对改善。

In this paper, we present CopyCat2 (CC2), a novel model capable of: a) synthesizing speech with different speaker identities, b) generating speech with expressive and contextually appropriate prosody, and c) transferring prosody at fine-grained level between any pair of seen speakers. We do this by activating distinct parts of the network for different tasks. We train our model using a novel approach to two-stage training. In Stage I, the model learns speaker-independent word-level prosody representations from speech which it uses for many-to-many fine-grained prosody transfer. In Stage II, we learn to predict these prosody representations using the contextual information available in text, thereby, enabling multi-speaker TTS with contextually appropriate prosody. We compare CC2 to two strong baselines, one in TTS with contextually appropriate prosody, and one in fine-grained prosody transfer. CC2 reduces the gap in naturalness between our baseline and copy-synthesised speech by $22.79\%$. In fine-grained prosody transfer evaluations, it obtains a relative improvement of $33.15\%$ in target speaker similarity.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源