在捷克的自动语音识别中使用大型数据集探索单语音变压器的功能

论文标题

在捷克的自动语音识别中使用大型数据集探索单语音变压器的功能

Exploring Capabilities of Monolingual Audio Transformers using Large Datasets in Automatic Speech Recognition of Czech

论文作者

Lehečka, Jan, Švec, Jan, Pražák, Aleš, Psutka, Josef V.

论文摘要

在本文中，我们介绍了从包含8万多个小时未标记的语音的大型数据集预处理捷克单语音频变压器方面的进展，随后使用近6000个小时的近距离转录的语音对自动语音识别任务进行了微调。我们正在通过在两个公共数据集（CommunVoice和Voxpopuli）上评估的各种微调设置进行大量实验调色板，并从Malach项目中进行了一个极具挑战性的数据集。我们的结果表明，单语WAV2VEC 2.0模型是强大的ASR系统，它可以利用大型标记和未标记的数据集并成功与最先进的LVCSR系统竞争。此外，当没有用于目标ASR任务的培训数据时，WAV2VEC模型被证明是很好的零射门学习者。

In this paper, we present our progress in pretraining Czech monolingual audio transformers from a large dataset containing more than 80 thousand hours of unlabeled speech, and subsequently fine-tuning the model on automatic speech recognition tasks using a combination of in-domain data and almost 6 thousand hours of out-of-domain transcribed speech. We are presenting a large palette of experiments with various fine-tuning setups evaluated on two public datasets (CommonVoice and VoxPopuli) and one extremely challenging dataset from the MALACH project. Our results show that monolingual Wav2Vec 2.0 models are robust ASR systems, which can take advantage of large labeled and unlabeled datasets and successfully compete with state-of-the-art LVCSR systems. Moreover, Wav2Vec models proved to be good zero-shot learners when no training data are available for the target ASR task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题