VQ-VAE语音波形重建中学的F0 F0代码簿表示的韵律改进

论文标题

VQ-VAE语音波形重建中学的F0 F0代码簿表示的韵律改进

Improved Prosody from Learned F0 Codebook Representations for VQ-VAE Speech Waveform Reconstruction

论文作者

Zhao, Yi, Li, Haoyu, Lai, Cheng-I, Williams, Jennifer, Cooper, Erica, Yamagishi, Junichi

论文摘要

矢量量化的变分自动编码器（VQ-VAE）是一个强大的表示学习框架，可以从不监督的情况下从语音信号中发现离散的特征组。到目前为止，VQ-VAE架构以前已经建模了单个类型的语音特征，例如仅手机或F0。本文介绍了对VQ-VAE的重要扩展，用于同时学习与F0相关的上段信息以及传统的电话功能。拟议的框架使用了两个编码器，例如F0轨迹和语音波形都是系统的输入，因此学习了两个单独的代码书。我们使用了Wavernn Vocoder作为VQ-VAE的解码器组件。我们独立于演讲者的VQ-VAE接受了来自多演讲者日本语音数据库的原始语音波形培训。实验结果表明，所提出的扩展可减少所有看不见的测试扬声器的重建语音的F0失真，并从听力测试中得出明显更高的偏好得分。我们还使用单扬声器普通话的语音进行了实验，以另一种依赖F0的语言来证明我们的建筑的优势。

Vector Quantized Variational AutoEncoders (VQ-VAE) are a powerful representation learning framework that can discover discrete groups of features from a speech signal without supervision. Until now, the VQ-VAE architecture has previously modeled individual types of speech features, such as only phones or only F0. This paper introduces an important extension to VQ-VAE for learning F0-related suprasegmental information simultaneously along with traditional phone features.The proposed framework uses two encoders such that the F0 trajectory and speech waveform are both input to the system, therefore two separate codebooks are learned. We used a WaveRNN vocoder as the decoder component of VQ-VAE. Our speaker-independent VQ-VAE was trained with raw speech waveforms from multi-speaker Japanese speech databases. Experimental results show that the proposed extension reduces F0 distortion of reconstructed speech for all unseen test speakers, and results in significantly higher preference scores from a listening test. We additionally conducted experiments using single-speaker Mandarin speech to demonstrate advantages of our architecture in another language which relies heavily on F0.

下载PDF全文

下载文献需遵守相关版权规定

论文标题