TTS样式转移的解开样式和扬声器属性

论文标题

TTS样式转移的解开样式和扬声器属性

Disentangling Style and Speaker Attributes for TTS Style Transfer

论文作者

An, Xiaochun, Soong, Frank K., Xie, Lei

论文摘要

端到端的神经TTS显示了语音样式转移的性能提高。但是，进步仍然受到目标样式和扬声器的可用培训数据的限制。此外，当训练有素的TTS试图从具有未知，任意风格的新扬声器转移到目标样式时，观察到退化的性能。在本文中，我们提出了一种新的方法，以在不同风格的数据集上看到和看不见的样式转移培训，即记录了不同样式的数据集，其中一种说话者是多种发言人的一种单独样式。首先引入了一种反向自回旋流量（IAF）技术，以改善学习表达方式表示的变异推断。然后，开发了一个扬声器编码器网络，用于学习歧视性的说话者嵌入，该嵌入式嵌入与REST神经TTS模块共同培训。所提出的可见和看不见的样式转移的方法有效地接受了六个专门设计的目标：重建损失，对抗性损失，样式失真损失，周期一致性损失，样式分类损失和说话者分类损失。实验在客观和主观上都证明了所提出和看不见样式转移任务的方法的有效性。我们的方法的性能优于其他四个先前艺术参考系统的表现。

End-to-end neural TTS has shown improved performance in speech style transfer. However, the improvement is still limited by the available training data in both target styles and speakers. Additionally, degenerated performance is observed when the trained TTS tries to transfer the speech to a target style from a new speaker with an unknown, arbitrary style. In this paper, we propose a new approach to seen and unseen style transfer training on disjoint, multi-style datasets, i.e., datasets of different styles are recorded, one individual style by one speaker in multiple utterances. An inverse autoregressive flow (IAF) technique is first introduced to improve the variational inference for learning an expressive style representation. A speaker encoder network is then developed for learning a discriminative speaker embedding, which is jointly trained with the rest neural TTS modules. The proposed approach of seen and unseen style transfer is effectively trained with six specifically-designed objectives: reconstruction loss, adversarial loss, style distortion loss, cycle consistency loss, style classification loss, and speaker classification loss. Experiments demonstrate, both objectively and subjectively, the effectiveness of the proposed approach for seen and unseen style transfer tasks. The performance of our approach is superior to and more robust than those of four other reference systems of prior art.

下载PDF全文

下载文献需遵守相关版权规定

论文标题