从嘈杂的样本进行域域中的域训练训练中，数据有效的语音克隆

论文标题

从嘈杂的样本进行域域中的域训练训练中，数据有效的语音克隆

Data Efficient Voice Cloning from Noisy Samples with Domain Adversarial Training

论文作者

Cong, Jian, Yang, Shan, Xie, Lei, Yu, Guoqiao, Wan, Guanglu

论文摘要

数据有效的语音克隆旨在综合目标扬声器的声音，只有少量注册样本。为此，扬声器的适应和扬声器编码是两种典型方法，基于从多个扬声器训练的基本模型。前者使用一小部分目标扬声器数据将多扬声器模型传输到通过直接模型更新来定位扬声器的声音，而在后者中，只有几秒钟的目标扬声器的音频直接通过编码额外的扬声器编码模型以及多扬声器模型，以合成目标扬声器的声音，而无需模型更新。然而，这两种方法需要干净的目标扬声器数据。但是，用户提供的样本可能不可避免地包含实际应用中的声学噪声。用嘈杂的数据生成目标语音仍然具有挑战性。在本文中，我们研究了基于序列到序列的TTS范式下嘈杂样品的数据有效语音克隆问题。具体来说，我们将域对抗训练（DAT）引入扬声器的适应和扬声器编码，该编码旨在将噪音与语音噪声混合在一起。实验表明，对于说话者的适应和编码，所提出的方法都可以始终如一地综合嘈杂的说话者样本中的简洁言语，显然比采用最先进的语音增强模块的方法表现优于这种方法。

Data efficient voice cloning aims at synthesizing target speaker's voice with only a few enrollment samples at hand. To this end, speaker adaptation and speaker encoding are two typical methods based on base model trained from multiple speakers. The former uses a small set of target speaker data to transfer the multi-speaker model to target speaker's voice through direct model update, while in the latter, only a few seconds of target speaker's audio directly goes through an extra speaker encoding model along with the multi-speaker model to synthesize target speaker's voice without model update. Nevertheless, the two methods need clean target speaker data. However, the samples provided by user may inevitably contain acoustic noise in real applications. It's still challenging to generating target voice with noisy data. In this paper, we study the data efficient voice cloning problem from noisy samples under the sequence-to-sequence based TTS paradigm. Specifically, we introduce domain adversarial training (DAT) to speaker adaptation and speaker encoding, which aims to disentangle noise from speech-noise mixture. Experiments show that for both speaker adaptation and encoding, the proposed approaches can consistently synthesize clean speech from noisy speaker samples, apparently outperforming the method adopting state-of-the-art speech enhancement module.

下载PDF全文

下载文献需遵守相关版权规定

论文标题