论文标题
WAV2VEC 2.0:自我监督语音表示的框架
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
论文作者
论文摘要
我们首次表明,仅凭语音音频学习强大的表示,然后对抄录语音进行微调可以优于最佳的半监督方法,而从概念上讲更简单。 WAV2VEC 2.0掩盖了潜在空间中的语音输入,并解决了在共同学习的潜在表示方面定义的对比任务。使用LibrisPeech的所有标记数据进行的实验在清洁/其他测试集上实现了1.8/3.3。当将标记的数据的量降低到一小时时,WAV2VEC 2.0在100小时子集上的先前最新状态优于先前的最新状态,同时使用标记的数据少100倍。仅使用10分钟的标签数据和53K小时未标记数据的预培训仍然可以达到4.8/8.2 WER。这证明了语音识别的可行性,其标记数据量有限。
We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned. Experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec 2.0 outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of labeled data and pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER. This demonstrates the feasibility of speech recognition with limited amounts of labeled data.