适应无监督语音分离的异质分离一致性训练

论文标题

适应无监督语音分离的异质分离一致性训练

Heterogeneous Separation Consistency Training for Adaptation of Unsupervised Speech Separation

论文作者

Han, Jiangyu, Long, Yanhua

论文摘要

最近，监督的言语分离取得了长足的进步。但是，受监督培训的性质的限制，大多数现有的分离方法都需要地面真实来源，并在合成数据集中接受培训。这种基础的依赖是有问题的，因为在实际条件下，基地信号通常不可用。此外，在许多行业方案中，真正的声学特征偏离了模拟数据集中的偏差。因此，在将监督语音分离模型应用于实际应用时，表现通常会大大降低。为了解决这些问题，在本研究中，我们提出了一种称为SCT的新型分离一致性训练，以利用现实世界中未标记的混合物来以迭代方式改善跨域的无监督语音分离，并利用从异构（结构上不同但行为上互补）模型中获得的互补信息来利用互补信息。 SCT使用两个异质神经网络（HNN）遵循一个框架，以产生未标记的真实语音混合物的高置信度标签。然后对这些标签进行更新，并用于完善HNN，以产生更可靠的一致分离结果，以实现真实混合物伪标记。为了最大程度地利用不同分离网络之间的大量互补信息，进一步提出了交叉知识适应。与模拟数据集一起，这些具有较高置信的真实混合物随后可用于更新HNN分离模型。此外，我们发现通过简单的线性融合梳理异质分离输出可以进一步稍微改善最终系统性能。

Recently, supervised speech separation has made great progress. However, limited by the nature of supervised training, most existing separation methods require ground-truth sources and are trained on synthetic datasets. This ground-truth reliance is problematic, because the ground-truth signals are usually unavailable in real conditions. Moreover, in many industry scenarios, the real acoustic characteristics deviate far from the ones in simulated datasets. Therefore, the performance usually degrades significantly when applying the supervised speech separation models to real applications. To address these problems, in this study, we propose a novel separation consistency training, termed SCT, to exploit the real-world unlabeled mixtures for improving cross-domain unsupervised speech separation in an iterative manner, by leveraging upon the complementary information obtained from heterogeneous (structurally distinct but behaviorally complementary) models. SCT follows a framework using two heterogeneous neural networks (HNNs) to produce high confidence pseudo labels of unlabeled real speech mixtures. These labels are then updated, and used to refine the HNNs to produce more reliable consistent separation results for real mixture pseudo-labeling. To maximally utilize the large complementary information between different separation networks, a cross-knowledge adaptation is further proposed. Together with simulated dataset, those real mixtures with high confidence pseudo labels are then used to update the HNN separation models iteratively. In addition, we find that combing the heterogeneous separation outputs by a simple linear fusion can further slightly improve the final system performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题