DECOAR 2.0：具有矢量量化的深层情境化声明

论文标题

DECOAR 2.0：具有矢量量化的深层情境化声明

DeCoAR 2.0: Deep Contextualized Acoustic Representations with Vector Quantization

论文作者

Ling, Shaoshi, Liu, Yuzong

论文摘要

语音表示学习的最新成功为利用未标记的数据训练语音识别模型提供了一种新的方法。在语音表示学习中，以自我监督的方式使用大量未标记的数据来学习特征表示。然后，使用较小的标记数据用于使用新功能表示形式训练下游ASR系统。基于我们以前的Decoar和其他语音表示学习的灵感，我们提出了Decoar 2.0，这是一种具有矢量量化的深层情境化的声学表示。我们对Decoar介绍了几种修改：首先，我们在编码模块而不是LSTMS中使用变压器；其次，我们在编码器和重建模块之间介绍了一个矢量量化层。第三，我们提出了一个目标，将重建性损失与矢量量化多样性损失与培训语音表示结合在一起。我们的实验表明，在不同的数据范围情景中，对其他语音表示的一致性一致。在不进行微调的情况下，在10个小时的Librispeech标记的数据上训练了带有Decoar 2.0功能的轻量级ASR模型，其功能的表现优于在完整的960小时数据集中训练的模型，并具有过滤器库功能。

Recent success in speech representation learning enables a new way to leverage unlabeled data to train speech recognition model. In speech representation learning, a large amount of unlabeled data is used in a self-supervised manner to learn a feature representation. Then a smaller amount of labeled data is used to train a downstream ASR system using the new feature representations. Based on our previous work DeCoAR and inspirations from other speech representation learning, we propose DeCoAR 2.0, a Deep Contextualized Acoustic Representation with vector quantization. We introduce several modifications over the DeCoAR: first, we use Transformers in encoding module instead of LSTMs; second, we introduce a vector quantization layer between encoder and reconstruction modules; third, we propose an objective that combines the reconstructive loss with vector quantization diversity loss to train speech representations. Our experiments show consistent improvements over other speech representations in different data-sparse scenarios. Without fine-tuning, a light-weight ASR model trained on 10 hours of LibriSpeech labeled data with DeCoAR 2.0 features outperforms the model trained on the full 960-hour dataset with filterbank features.

下载PDF全文

下载文献需遵守相关版权规定

论文标题