论文标题

具有变压器的目标扬声器语音活动检测及其与端到端神经腹泻的整合

Target Speaker Voice Activity Detection with Transformers and Its Integration with End-to-End Neural Diarization

论文作者

Wang, Dongmei, Xiao, Xiong, Kanda, Naoyuki, Yoshioka, Takuya, Wu, Jian

论文摘要

本文介绍了使用变压器基于目标扬声器语音活动检测(TS-VAD)的扬声器诊断模型。为了克服原始的TS-VAD模型无法处理任意数量的扬声器的缺点,我们研究了使用具有可变的时间长度和扬声器尺寸的输入张量的模型架构。将变压器层应用于扬声器轴,以使模型输出对提供给TS-VAD模型的扬声器配置文件的顺序不敏感。时间顺序层插入了这些说话者的变压器层之间,以允许捕获输入语音信号的时间和跨语言器相关性。我们还使用基于编码器的吸引子(EEND-EDA)将基于端到端神经诊断的诊断模型通过基于变压器的TS-VAD替换其基于DOT的扬声器检测层,从而扩展了基于端到端的神经诊断。 VoxConverse上的实验结果表明,使用变压器进行跨言式言论机建模可将TS-VAD的诊断错误率(DER)降低11.3%,从而使新的最先进(SOTA)DER达到4.57%。同样,相对于原始的eend-eda,我们的扩展的eend-eDa在callhome数据集上减少了6.9%,在广泛使用的训练数据设置下,新的SOTA DER为11.18%。

This paper describes a speaker diarization model based on target speaker voice activity detection (TS-VAD) using transformers. To overcome the original TS-VAD model's drawback of being unable to handle an arbitrary number of speakers, we investigate model architectures that use input tensors with variable-length time and speaker dimensions. Transformer layers are applied to the speaker axis to make the model output insensitive to the order of the speaker profiles provided to the TS-VAD model. Time-wise sequential layers are interspersed between these speaker-wise transformer layers to allow the temporal and cross-speaker correlations of the input speech signal to be captured. We also extend a diarization model based on end-to-end neural diarization with encoder-decoder based attractors (EEND-EDA) by replacing its dot-product-based speaker detection layer with the transformer-based TS-VAD. Experimental results on VoxConverse show that using the transformers for the cross-speaker modeling reduces the diarization error rate (DER) of TS-VAD by 11.3%, achieving a new state-of-the-art (SOTA) DER of 4.57%. Also, our extended EEND-EDA reduces DER by 6.9% on the CALLHOME dataset relative to the original EEND-EDA with a similar model size, achieving a new SOTA DER of 11.18% under a widely used training data setting.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源