探索WAV2VEC 2.0在扬声器验证和语言识别上

论文标题

探索WAV2VEC 2.0在扬声器验证和语言识别上

Exploring wav2vec 2.0 on speaker verification and language identification

论文作者

Fan, Zhiyun, Li, Meng, Zhou, Shiyu, Xu, Bo

论文摘要

WAV2VEC 2.0是最近提出的语音表示学习框架。它遵循了两个阶段的培训过程，进行了预训练和微调，并且在语音识别任务中尤其是超低的资源案例中表现良好。在这项工作中，我们试图将自我监管的框架扩展到说话者验证和语言识别。首先，我们使用一些初步实验表明WAV2VEC 2.0可以捕获有关扬声器和语言的信息。然后，我们分别证明了WAV2VEC 2.0在这两个任务上的有效性。对于说话者验证，我们在Voxceleb1数据集上获得了新的最新结果，相等的错误率（EER）为3.61％。对于语言识别，我们在1秒条件下获得了12.02％的EER，在AP17-OLR数据集的全长条件下，EER为3.47％。最后，我们利用一个模型通过为两个任务的多任务学习实现统一的建模。

Wav2vec 2.0 is a recently proposed self-supervised framework for speech representation learning. It follows a two-stage training process of pre-training and fine-tuning, and performs well in speech recognition tasks especially ultra-low resource cases. In this work, we attempt to extend self-supervised framework to speaker verification and language identification. First, we use some preliminary experiments to indicate that wav2vec 2.0 can capture the information about the speaker and language. Then we demonstrate the effectiveness of wav2vec 2.0 on the two tasks respectively. For speaker verification, we obtain a new state-of-the-art result, Equal Error Rate (EER) of 3.61% on the VoxCeleb1 dataset. For language identification, we obtain an EER of 12.02% on 1 second condition and an EER of 3.47% on full-length condition of the AP17-OLR dataset. Finally, we utilize one model to achieve the unified modeling by the multi-task learning for the two tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题