端到端的神经诊断：将扬声器诊断重新进行简单的多标签分类

论文标题

端到端的神经诊断：将扬声器诊断重新进行简单的多标签分类

End-to-End Neural Diarization: Reformulating Speaker Diarization as Simple Multi-label Classification

论文作者

Fujita, Yusuke, Watanabe, Shinji, Horiguchi, Shota, Xue, Yawen, Nagamatsu, Kenji

论文摘要

说话者诊断的最常见方法是说话者嵌入的聚类。但是，基于聚类的方法有许多问题。即，（i）直接将诊断错误最小化，（ii）无法正确处理扬声器重叠，并且（iii）（iii）将其扬声器嵌入模型嵌入到具有扬声器重叠的真实音频录音时，它没有优化。为了解决这些问题，我们提出了端到端的神经诊断（EEND），其中神经网络直接输出扬声器诊断的扬声器诊断结果。为了实现这样的端到端模型，我们将说话者诊断问题作为多标签分类问题，并引入无排序的目标函数，以直接最大程度地减少诊断误差。除了其端到端的简单性外，REEND方法还可以在培训和推理过程中明确处理扬声器的重叠。仅通过使用相应的扬声器段标签喂食多演讲者录音，我们的模型就可以轻松地适应真实的对话。我们评估了我们的模拟语音混合物和真实对话数据集的方法。结果表明，eend方法的表现优于最新的基于X-vector聚类的方法，而它正确处理了扬声器的重叠。我们探索了REEND方法的神经网络体系结构，发现基于自发的神经网络是实现出色表现的关键。与使用双向长期记忆（BLSTM）一样，仅在其上一个和下一个隐藏状态下对网络进行调节相反，自我注意力直接在所有框架上进行调节。通过可视化注意力的重量，我们表明自我发作还捕获了除本地语音活动动态外，还可以捕获全球说话者的特征，使其特别适合处理说话者诊断问题。

The most common approach to speaker diarization is clustering of speaker embeddings. However, the clustering-based approach has a number of problems; i.e., (i) it is not optimized to minimize diarization errors directly, (ii) it cannot handle speaker overlaps correctly, and (iii) it has trouble adapting their speaker embedding models to real audio recordings with speaker overlaps. To solve these problems, we propose the End-to-End Neural Diarization (EEND), in which a neural network directly outputs speaker diarization results given a multi-speaker recording. To realize such an end-to-end model, we formulate the speaker diarization problem as a multi-label classification problem and introduce a permutation-free objective function to directly minimize diarization errors. Besides its end-to-end simplicity, the EEND method can explicitly handle speaker overlaps during training and inference. Just by feeding multi-speaker recordings with corresponding speaker segment labels, our model can be easily adapted to real conversations. We evaluated our method on simulated speech mixtures and real conversation datasets. The results showed that the EEND method outperformed the state-of-the-art x-vector clustering-based method, while it correctly handled speaker overlaps. We explored the neural network architecture for the EEND method, and found that the self-attention-based neural network was the key to achieving excellent performance. In contrast to conditioning the network only on its previous and next hidden states, as is done using bidirectional long short-term memory (BLSTM), self-attention is directly conditioned on all the frames. By visualizing the attention weights, we show that self-attention captures global speaker characteristics in addition to local speech activity dynamics, making it especially suitable for dealing with the speaker diarization problem.

下载PDF全文

下载文献需遵守相关版权规定

论文标题