扬声器意识的混合物培训的混合物培训用于弱监督的扬声器提取

论文标题

扬声器意识的混合物培训的混合物培训用于弱监督的扬声器提取

Speaker-Aware Mixture of Mixtures Training for Weakly Supervised Speaker Extraction

论文作者

Zhao, Zifeng, Gu, Rongzhi, Yang, Dongchao, Tian, Jinchuan, Zou, Yuexian

论文摘要

主要的研究进行了监督的培训，以提取说话者提取，而理想情况下的稀缺性和渠道不匹配问题的稀缺很少被考虑。为此，我们提出了混合培训（SAMOM）的说话者感知的混合物，利用目标源之间的说话者身份的一致性，注册话语和目标估算，以薄弱地监督对扬声器提取器的训练。在Samom中，该输入是通过混合不同的扬声器吸引的混合物（SAM）来构建的，每个混合物都包含多个扬声器，其身份已知和可用的招生话语。通过注册话语，目标语音是从输入一一提取的，因此估计的目标可以根据身份一致性在混音后近似原始SAM。此外，在半监督的设置中使用Samom和一定量的干净来源启用了在嘈杂方案中的应用程序。在Libri2Mix上进行的广泛实验表明，所提出的方法可实现有希望的结果，而无需访问任何干净的来源（11.06DB SI-SDRI）。通过域的适应，我们的方法在Aishell-1的跨域评估中甚至超过了监督框架。

Dominant researches adopt supervised training for speaker extraction, while the scarcity of ideally clean corpus and channel mismatch problem are rarely considered. To this end, we propose speaker-aware mixture of mixtures training (SAMoM), utilizing the consistency of speaker identity among target source, enrollment utterance and target estimate to weakly supervise the training of a deep speaker extractor. In SAMoM, the input is constructed by mixing up different speaker-aware mixtures (SAMs), each contains multiple speakers with their identities known and enrollment utterances available. Informed by enrollment utterances, target speech is extracted from the input one by one, such that the estimated targets can approximate the original SAMs after a remix in accordance with the identity consistency. Moreover, using SAMoM in a semi-supervised setting with a certain amount of clean sources enables application in noisy scenarios. Extensive experiments on Libri2Mix show that the proposed method achieves promising results without access to any clean sources (11.06dB SI-SDRi). With a domain adaptation, our approach even outperformed supervised framework in a cross-domain evaluation on AISHELL-1.

下载PDF全文

下载文献需遵守相关版权规定

论文标题