论文标题
从文本到语音的学习扬声器
Learning Speaker Embedding from Text-to-Speech
论文作者
论文摘要
给定输入文本和相应的扬声器嵌入,零击的多演讲者文本对语音(TTS)生成目标扬声器的声音。在这项工作中,我们调查了TTS重建目标的有效性,以改善说话者验证的代表性学习。我们共同训练有素的端到端TACOTRON 2 TTS,以一种自我监管的方式嵌入网络的扬声器。我们假设嵌入式将包含最小的语音信息,因为TTS解码器将从文本输入中获取该信息。 TTS重建也可以与扬声器分类结合使用,以进一步增强这些嵌入。一旦训练,扬声器编码器将计算说话者验证任务的表示形式,而其余的TTS块被丢弃。我们研究了手册或ASR生成的成绩单的培训TT。后者允许我们在没有手动成绩单的情况下训练数据集中的嵌入。我们将ASR成绩单和Kaldi电话对齐方式与TTS输入进行了比较,这表明后者由于分辨率更高而表现更好。关于库列特斯数据集的i-vectors,无监督的TT嵌入将EER提高了2.06 \%绝对。带有扬声器分类损失的TT将EER分别从模型中分别使用库和Voxceleb1中的扬声器分类损失,从而使EER提高了0.28 \%和0.73 \%。
Zero-shot multi-speaker Text-to-Speech (TTS) generates target speaker voices given an input text and the corresponding speaker embedding. In this work, we investigate the effectiveness of the TTS reconstruction objective to improve representation learning for speaker verification. We jointly trained end-to-end Tacotron 2 TTS and speaker embedding networks in a self-supervised fashion. We hypothesize that the embeddings will contain minimal phonetic information since the TTS decoder will obtain that information from the textual input. TTS reconstruction can also be combined with speaker classification to enhance these embeddings further. Once trained, the speaker encoder computes representations for the speaker verification task, while the rest of the TTS blocks are discarded. We investigated training TTS from either manual or ASR-generated transcripts. The latter allows us to train embeddings on datasets without manual transcripts. We compared ASR transcripts and Kaldi phone alignments as TTS inputs, showing that the latter performed better due to their finer resolution. Unsupervised TTS embeddings improved EER by 2.06\% absolute with regard to i-vectors for the LibriTTS dataset. TTS with speaker classification loss improved EER by 0.28\% and 0.73\% absolutely from a model using only speaker classification loss in LibriTTS and Voxceleb1 respectively.