X-DC：基于可学习的光谱图模板的可解释的深聚类

论文标题

X-DC：基于可学习的光谱图模板的可解释的深聚类

X-DC: Explainable Deep Clustering based on Learnable Spectrogram Templates

论文作者

Watanabe, Chihiro, Kameoka, Hirokazu

论文摘要

深度神经网络（DNN）在各种语音处理任务中都实现了实质性的预测性能。尤其是，已经表明，可以通过一种称为Deep Clustering（DC）的基于DNN的方法成功解决单声道语音分离任务，该方法使用DNN来描述将连续矢量分配给每个时间频率（TF）箱的过程，并测量每对TF BIN的可能性由同一讲话者支配的可能性。在DC中，对DNN进行了训练，因此由同一扬声器主导的TF箱的嵌入向量被迫彼此接近。关于DC的一个问题是，DNN描述的嵌入过程具有黑盒结构，通常很难解释。由于不可解剖的黑盒结构，潜在的弱点是，它缺乏解决训练和测试条件之间不匹配的灵活性（例如，由混响引起）。为了克服这一限制，在本文中，我们提出了可解释的深度聚类（X-DC）的概念，其网络体系结构可以解释为将可学习的频谱模板拟合到输入频谱图，然后进行Wiener滤波的过程。在训练过程中，频谱图模板及其激活的要素被限制为非负值，这有助于其值的稀疏性，从而提高了可解释性。该框架的主要优点是，它自然地使我们能够通过其物理解释的结构将模型适应机制纳入网络。我们通过实验表明，所提出的X-DC使我们能够可视化和理解模型的线索，以确定嵌入向量的同时实现与原始DC模型相当的语音分离性能。

Deep neural networks (DNNs) have achieved substantial predictive performance in various speech processing tasks. Particularly, it has been shown that a monaural speech separation task can be successfully solved with a DNN-based method called deep clustering (DC), which uses a DNN to describe the process of assigning a continuous vector to each time-frequency (TF) bin and measure how likely each pair of TF bins is to be dominated by the same speaker. In DC, the DNN is trained so that the embedding vectors for the TF bins dominated by the same speaker are forced to get close to each other. One concern regarding DC is that the embedding process described by a DNN has a black-box structure, which is usually very hard to interpret. The potential weakness owing to the non-interpretable black-box structure is that it lacks the flexibility of addressing the mismatch between training and test conditions (caused by reverberation, for instance). To overcome this limitation, in this paper, we propose the concept of explainable deep clustering (X-DC), whose network architecture can be interpreted as a process of fitting learnable spectrogram templates to an input spectrogram followed by Wiener filtering. During training, the elements of the spectrogram templates and their activations are constrained to be non-negative, which facilitates the sparsity of their values and thus improves interpretability. The main advantage of this framework is that it naturally allows us to incorporate a model adaptation mechanism into the network thanks to its physically interpretable structure. We experimentally show that the proposed X-DC enables us to visualize and understand the clues for the model to determine the embedding vectors while achieving speech separation performance comparable to that of the original DC models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题