增强自我监管的嵌入以增强语音

论文标题

增强自我监管的嵌入以增强语音

Boosting Self-Supervised Embeddings for Speech Enhancement

论文作者

Hung, Kuo-Hsuan, Fu, Szu-wei, Tseng, Huan-Hsin, Chiang, Hsin-Tien, Tsao, Yu, Lin, Chii-Wann

论文摘要

语音的自我监督学习（SSL）表示已在几个下游任务上实现了最先进的表现（SOTA）。但是，语音增强（SE）任务的改善仍然存在。在这项研究中，我们使用了跨域功能来解决SSL嵌入可能缺乏细粒度信息来再生语音信号的问题。通过整合SSL表示和频谱图，可以显着提高结果。我们进一步研究了通过清洁距离距离（CN距离）的SSL表示噪声鲁棒性与SE的层重要性之间的关系。因此，我们发现具有较低噪声稳健性的SSL表示更为重要。此外，我们在VCTK需求数据集上的实验表明，使用SE模型对SSL表示的微调可以胜过PESQ，CSIG和COVL中基于SOTA SSL的SE方法，而无需调用复杂的网络体系结构。在后来的实验中，观察到SSL嵌入中的CN距离在微调后增加。这些结果验证了我们的期望，并可能有助于设计与SE相关的SSL培训。

Self-supervised learning (SSL) representation for speech has achieved state-of-the-art (SOTA) performance on several downstream tasks. However, there remains room for improvement in speech enhancement (SE) tasks. In this study, we used a cross-domain feature to solve the problem that SSL embeddings may lack fine-grained information to regenerate speech signals. By integrating the SSL representation and spectrogram, the result can be significantly boosted. We further study the relationship between the noise robustness of SSL representation via clean-noisy distance (CN distance) and the layer importance for SE. Consequently, we found that SSL representations with lower noise robustness are more important. Furthermore, our experiments on the VCTK-DEMAND dataset demonstrated that fine-tuning an SSL representation with an SE model can outperform the SOTA SSL-based SE methods in PESQ, CSIG and COVL without invoking complicated network architectures. In later experiments, the CN distance in SSL embeddings was observed to increase after fine-tuning. These results verify our expectations and may help design SE-related SSL training in the future.

下载PDF全文

下载文献需遵守相关版权规定

论文标题