论文标题
说话者诊断和语音识别的串联多任务培训会议转录
Tandem Multitask Training of Speaker Diarisation and Speech Recognition for Meeting Transcription
论文作者
论文摘要
基于自我监督学习的语音数据的预训练模型,例如WAV2VEC 2.0(W2V2),已成为许多语音任务的骨干。在本文中,为了实现使用单个模型的说话者诊断和语音识别,提出了串联多任务训练(TMT)方法以微调W2V2。对于说话者诊断,需要语音活动检测任务(VAD)和说话者分类(SC),并且使用连接派时间分类(CTC)用于ASR。多任务框架使用W2V2的早期层,中层和晚期实现VAD,SC和ASR,这与用VAD分割音频的顺序,基于扬声器的嵌入,并将每个片段转录为ASR。增强多方(AMI)数据集的实验结果表明,从较早到较晚的TMT使用不同的W2V2层为TMT使用不同的W2V2层不仅可以节省计算成本,还可以降低诊断误差率(DERS)。与单独的微调模型相比,VAD,SC和ASR的关节微调分别通过手动/自动分割分别通过手动/自动分割产生16%/17%的DER相对减少,并归因于词语错误率,并归因于单词错误率。
Self-supervised-learning-based pre-trained models for speech data, such as Wav2Vec 2.0 (W2V2), have become the backbone of many speech tasks. In this paper, to achieve speaker diarisation and speech recognition using a single model, a tandem multitask training (TMT) method is proposed to fine-tune W2V2. For speaker diarisation, the tasks of voice activity detection (VAD) and speaker classification (SC) are required, and connectionist temporal classification (CTC) is used for ASR. The multitask framework implements VAD, SC, and ASR using an early layer, middle layer, and late layer of W2V2, which coincides with the order of segmenting the audio with VAD, clustering the segments based on speaker embeddings, and transcribing each segment with ASR. Experimental results on the augmented multi-party (AMI) dataset showed that using different W2V2 layers for VAD, SC, and ASR from the earlier to later layers for TMT not only saves computational cost, but also reduces diarisation error rates (DERs). Joint fine-tuning of VAD, SC, and ASR yielded 16%/17% relative reductions of DER with manual/automatic segmentation respectively, and consistent reductions in speaker attributed word error rate, compared to the baseline with separately fine-tuned models.