语音识别，言语增强和自我监督的学习表示的端到端整合

论文标题

语音识别，言语增强和自我监督的学习表示的端到端整合

End-to-End Integration of Speech Recognition, Speech Enhancement, and Self-Supervised Learning Representation

论文作者

Chang, Xuankai, Maekaku, Takashi, Fujita, Yuya, Watanabe, Shinji

论文摘要

这项工作介绍了我们的端到端（E2E）自动语音识别（ASR）模型以强大的语音识别为目标，称为综合语音识别，并具有增强的语音输入，以进行自我监督的学习表示（IRIS）。与常规的E2E ASR模型相比，提出的E2E模型集成了两个重要的模块，包括语音增强（SE）模块和自我监督的学习表示（SSLR）模块。 SE模块增强了嘈杂的演讲。然后，SSLR模块从增强的语音中提取特征，用于语音识别（ASR）。为了培训所提出的模型，我们建立了一个有效的学习计划。对单声道循环-4任务的评估结果表明，由于具有强大功能强大的预训练的SSLR模块和微调的SE模块，因此，IRIS模型获得了单渠道Chime-4基准中文献报道的最佳性能（实际开发为2.0％，实际测试为3.9％）。

This work presents our end-to-end (E2E) automatic speech recognition (ASR) model targetting at robust speech recognition, called Integraded speech Recognition with enhanced speech Input for Self-supervised learning representation (IRIS). Compared with conventional E2E ASR models, the proposed E2E model integrates two important modules including a speech enhancement (SE) module and a self-supervised learning representation (SSLR) module. The SE module enhances the noisy speech. Then the SSLR module extracts features from enhanced speech to be used for speech recognition (ASR). To train the proposed model, we establish an efficient learning scheme. Evaluation results on the monaural CHiME-4 task show that the IRIS model achieves the best performance reported in the literature for the single-channel CHiME-4 benchmark (2.0% for the real development and 3.9% for the real test) thanks to the powerful pre-trained SSLR module and the fine-tuned SE module.

下载PDF全文

下载文献需遵守相关版权规定

论文标题