使用全球和本地吸引力的无限数量的演讲者的在线神经诊断

论文标题

使用全球和本地吸引力的无限数量的演讲者的在线神经诊断

Online Neural Diarization of Unlimited Numbers of Speakers Using Global and Local Attractors

论文作者

Horiguchi, Shota, Watanabe, Shinji, Garcia, Paola, Takashima, Yuki, Kawaguchi, Yohei

论文摘要

本文描述了一种无限数量的扬声器的脱机和在线扬声器诊断的方法。端到端的神经诊断（EEND）通过将其作为多标签分类问题提出来实现重叠的说话者诊断。通过引入扬声器的吸引者，它也已扩展了灵活数量的扬声器。但是，基于吸引子的回弹的说话者的输出数量在经验上被限制在经验上。它无法处理推理期间出现的说话者数量高于培训期间的案例，因为其说话者计数以完全监督的方式进行了培训。我们的方法eend-gla通过将无监督的聚类引入基于吸引子的回象来解决这个问题。在该方法中，首先将输入音频分为短块，然后对每个块进行基于吸引子的腹泻，最后，根据局部计算的吸引子之间的相似性，将每个块的结果聚集。尽管每个块的输出扬声器数量受到限制，但估计整个输入的扬声器总数可能高于限制。要以在线方式使用REEND-GLA，我们的方法还扩展了扬声器追踪缓冲液，该缓冲液最初是为了在线推断传统的反应。我们引入了块式缓冲区更新，以使扬声器追踪缓冲区与EEND-GLA兼容。最后，为了改善在线诊断，我们的方法改善了缓冲区的更新方法，并重新审视了对电源的可变大小培训。实验结果表明，REEND-GLA可以在离线和在线推论中对看不见的扬声器进行释放。

A method to perform offline and online speaker diarization for an unlimited number of speakers is described in this paper. End-to-end neural diarization (EEND) has achieved overlap-aware speaker diarization by formulating it as a multi-label classification problem. It has also been extended for a flexible number of speakers by introducing speaker-wise attractors. However, the output number of speakers of attractor-based EEND is empirically capped; it cannot deal with cases where the number of speakers appearing during inference is higher than that during training because its speaker counting is trained in a fully supervised manner. Our method, EEND-GLA, solves this problem by introducing unsupervised clustering into attractor-based EEND. In the method, the input audio is first divided into short blocks, then attractor-based diarization is performed for each block, and finally, the results of each block are clustered on the basis of the similarity between locally-calculated attractors. While the number of output speakers is limited within each block, the total number of speakers estimated for the entire input can be higher than the limitation. To use EEND-GLA in an online manner, our method also extends the speaker-tracing buffer, which was originally proposed to enable online inference of conventional EEND. We introduce a block-wise buffer update to make the speaker-tracing buffer compatible with EEND-GLA. Finally, to improve online diarization, our method improves the buffer update method and revisits the variable chunk-size training of EEND. The experimental results demonstrate that EEND-GLA can perform speaker diarization of an unseen number of speakers in both offline and online inferences.

下载PDF全文

下载文献需遵守相关版权规定

论文标题