无监督的学习语音内容和样式表示

论文标题

无监督的学习语音内容和样式表示

Unsupervised Learning of Disentangled Speech Content and Style Representation

论文作者

Tjandra, Andros, Pang, Ruoming, Zhang, Yu, Karita, Shigeki

论文摘要

我们提出了一种无监督的语音表示学习方法的方法。我们的模型包括：（1）捕获人均信息的本地编码器；（2）捕获每一含量信息的全球编码器；（3）一个有条件的解码器，该解码器重建了给定本地和全局潜在变量的语音。我们的实验表明，（1）局部潜在变量编码语音内容，因为重建的语音可以通过低单词错误率（WER）识别，即使使用不同的全局编码；（2）全局潜在变量编码扬声器样式，作为重建的语音分享扬声器身份，并带有全局编码的来源。此外，我们展示了我们的预训练模型的有用应用，在该模型中，我们可以从全球潜在变量中训练扬声器识别模型，并通过对每个扬声器的一个标签的数据进行微调来实现高精度。

We present an approach for unsupervised learning of speech representation disentangling contents and styles. Our model consists of: (1) a local encoder that captures per-frame information; (2) a global encoder that captures per-utterance information; and (3) a conditional decoder that reconstructs speech given local and global latent variables. Our experiments show that (1) the local latent variables encode speech contents, as reconstructed speech can be recognized by ASR with low word error rates (WER), even with a different global encoding; (2) the global latent variables encode speaker style, as reconstructed speech shares speaker identity with the source utterance of the global encoding. Additionally, we demonstrate an useful application from our pre-trained model, where we can train a speaker recognition model from the global latent variables and achieve high accuracy by fine-tuning with as few data as one label per speaker.

下载PDF全文

下载文献需遵守相关版权规定

论文标题