论文标题
无监督的跨域语音转换与时频一致性
Unsupervised Cross-Domain Speech-to-Speech Conversion with Time-Frequency Consistency
论文作者
论文摘要
近年来,基于生成的对抗网络(GAN)模型已成功应用于无监督的语音转换。大小频谱的丰富紧凑型谐波视图被认为是使用音频数据训练这些模型的合适选择。为了重建语音信号,首先是由神经网络生成的幅度频谱图,然后通过诸如Griffin-Lim算法之类的方法用于重建相光谱图。该过程存在一个问题,即生成的幅度谱图可能不一致,这是找到一个相需要的,使整个频谱图具有自然声音的语音波形。在这项工作中,我们通过提出一种在对抗训练程序中鼓励光谱图一致性的条件来解决此问题。我们展示了将男性演讲者的声音转换为女性演讲者的任务,反之亦然。我们在Librispeech语料库上的实验结果表明,经过TF一致性训练的模型提供了语音到语音转换的感觉更好。
In recent years generative adversarial network (GAN) based models have been successfully applied for unsupervised speech-to-speech conversion.The rich compact harmonic view of the magnitude spectrogram is considered a suitable choice for training these models with audio data. To reconstruct the speech signal first a magnitude spectrogram is generated by the neural network, which is then utilized by methods like the Griffin-Lim algorithm to reconstruct a phase spectrogram. This procedure bears the problem that the generated magnitude spectrogram may not be consistent, which is required for finding a phase such that the full spectrogram has a natural-sounding speech waveform. In this work, we approach this problem by proposing a condition encouraging spectrogram consistency during the adversarial training procedure. We demonstrate our approach on the task of translating the voice of a male speaker to that of a female speaker, and vice versa. Our experimental results on the Librispeech corpus show that the model trained with the TF consistency provides a perceptually better quality of speech-to-speech conversion.