将预告片的语言模型提炼成多语言ASR模型

论文标题

将预告片的语言模型提炼成多语言ASR模型

Distilling a Pretrained Language Model to a Multilingual ASR Model

论文作者

Choi, Kwanghee, Park, Hyung-Min

论文摘要

多语言语音数据通常会遭受长尾语的语言分布，从而导致性能退化。但是，多语言文本数据更容易获得，从而产生了更有用的通用语言模型。因此，我们有动力将嵌入在训练有素的教师文本模型中的丰富知识提炼成学生的演讲模型。我们提出了一种称为语言模型（Distill-L2S）的新方法，称为语言模型，该模型将两种不同模态的潜在表示一致。微妙的差异是通过收缩机制，最近的邻居插值和可学习的线性投影层来处理的。我们通过将其应用于多语言自动语音识别（ASR）任务来证明我们的蒸馏方法的有效性。我们在微调每种语言的大型多语言ASR模型（XLSR-WAV2VEC 2.0）的同时，将基于变压器的跨语性语言模型（Infoxlm）提炼出来。我们显示了我们的方法对公共视觉数据集的20种低资源语言的优势，其语音数据少于100个小时。

Multilingual speech data often suffer from long-tailed language distribution, resulting in performance degradation. However, multilingual text data is much easier to obtain, yielding a more useful general language model. Hence, we are motivated to distill the rich knowledge embedded inside a well-trained teacher text model to the student speech model. We propose a novel method called the Distilling a Language model to a Speech model (Distill-L2S), which aligns the latent representations of two different modalities. The subtle differences are handled by the shrinking mechanism, nearest-neighbor interpolation, and a learnable linear projection layer. We demonstrate the effectiveness of our distillation method by applying it to the multilingual automatic speech recognition (ASR) task. We distill the transformer-based cross-lingual language model (InfoXLM) while fine-tuning the large-scale multilingual ASR model (XLSR-wav2vec 2.0) for each language. We show the superiority of our method on 20 low-resource languages of the CommonVoice dataset with less than 100 hours of speech data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题