通过非平行培训数据将学习从语音综合到语音转换

论文标题

通过非平行培训数据将学习从语音综合到语音转换

Transfer Learning from Speech Synthesis to Voice Conversion with Non-Parallel Training Data

论文作者

Zhang, Mingyang, Zhou, Yi, Zhao, Li, Li, Haizhou

论文摘要

本文通过从文本到语音（TTS）综合系统中学习来构建语音转换系统（VC）系统的新框架，称为TTS-VC传输学习。我们首先使用序列到序列编码器架构进行了多演讲者的语音综合系统，其中编码器提取了文本的强大语言表示，而解码器则以目标扬声器嵌入为条件，采用上下文矢量和注意力恢复性网络输出来产生目标量化目标声学特征。我们利用了TTS系统地图输入文本对扬声器独立上下文向量的事实，并重复使用此类映射以监督编码器码头语音转换系统潜在表示的培训。在语音转换系统中，编码器将语音而不是文本作为输入，而解码器在功能上与TTS解码器相似。当我们在扬声器嵌入式上调节解码器时，可以对系统进行非并行数据培训，以进行任何对任何语音转换。在语音转换训练中，我们分别向语音综合和语音转换网络介绍文本和语音。在运行时，语音转换网络使用其自己的编码器架构体系结构。实验表明，在语音质量，自然性和说话者的相似性方面，所提出的方法始终超过两个竞争性语音转换基线，即语音后验和各种自动编码器方法。

This paper presents a novel framework to build a voice conversion (VC) system by learning from a text-to-speech (TTS) synthesis system, that is called TTS-VC transfer learning. We first develop a multi-speaker speech synthesis system with sequence-to-sequence encoder-decoder architecture, where the encoder extracts robust linguistic representations of text, and the decoder, conditioned on target speaker embedding, takes the context vectors and the attention recurrent network cell output to generate target acoustic features. We take advantage of the fact that TTS system maps input text to speaker independent context vectors, and reuse such a mapping to supervise the training of latent representations of an encoder-decoder voice conversion system. In the voice conversion system, the encoder takes speech instead of text as input, while the decoder is functionally similar to TTS decoder. As we condition the decoder on speaker embedding, the system can be trained on non-parallel data for any-to-any voice conversion. During voice conversion training, we present both text and speech to speech synthesis and voice conversion networks respectively. At run-time, the voice conversion network uses its own encoder-decoder architecture. Experiments show that the proposed approach outperforms two competitive voice conversion baselines consistently, namely phonetic posteriorgram and variational autoencoder methods, in terms of speech quality, naturalness, and speaker similarity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题