共同发言人计数，语音识别和说话者身份证明任何数量的演讲者的演讲

论文标题

共同发言人计数，语音识别和说话者身份证明任何数量的演讲者的演讲

Joint Speaker Counting, Speech Recognition, and Speaker Identification for Overlapped Speech of Any Number of Speakers

论文作者

Kanda, Naoyuki, Gaur, Yashesh, Wang, Xiaofei, Meng, Zhong, Chen, Zhuo, Zhou, Tianyan, Yoshioka, Takuya

论文摘要

我们提出了一个端到端的说话者归类的自动语音识别模型，该模型统一说话者计数，语音识别和对单声道重叠语音的识别。我们的模型建立在序列化输出培训（SOT）的基础上，该模型与基于注意的编码器描述器，这是一种识别包含任意扬声器数量的重叠语音的最近提出的方法。我们通过引入扬声器库存作为辅助输入来扩展SOT，以制作扬声器标签以及多演讲者的转录。所有模型参数均通过说话者归属的最大共同信息标准优化，这代表了重叠语音识别和说话者识别的关节概率。在Librispeech语料库上进行的实验表明，我们提出的方法比单独执行重叠的语音识别和说话者识别的基线的说话者归纳的单词错误率要好得多。

We propose an end-to-end speaker-attributed automatic speech recognition model that unifies speaker counting, speech recognition, and speaker identification on monaural overlapped speech. Our model is built on serialized output training (SOT) with attention-based encoder-decoder, a recently proposed method for recognizing overlapped speech comprising an arbitrary number of speakers. We extend SOT by introducing a speaker inventory as an auxiliary input to produce speaker labels as well as multi-speaker transcriptions. All model parameters are optimized by speaker-attributed maximum mutual information criterion, which represents a joint probability for overlapped speech recognition and speaker identification. Experiments on LibriSpeech corpus show that our proposed method achieves significantly better speaker-attributed word error rate than the baseline that separately performs overlapped speech recognition and speaker identification.

下载PDF全文

下载文献需遵守相关版权规定

论文标题