通过时间卷积网络探索基于DNN的低延迟语音增强功能的最佳损失函数

论文标题

通过时间卷积网络探索基于DNN的低延迟语音增强功能的最佳损失函数

Exploring the Best Loss Function for DNN-Based Low-latency Speech Enhancement with Temporal Convolutional Networks

论文作者

Koyama, Yuichiro, Vuong, Tyler, Uhlich, Stefan, Raj, Bhiksha

论文摘要

最近，深层神经网络（DNN）已成功用于语音增强，基于DNN的语音增强正成为一个有吸引力的研究领域。虽然在过去几年中，基于短时傅立叶变换（STFT）的时频掩蔽已被广泛用于基于DNN的语音增强，但还提出了时间域方法，例如时域音频分离网络（TASNET）。最合适的方法取决于数据集的规模和任务类型。在本文中，我们探讨了两个不同数据集上最好的语音增强算法。我们建议使用问题 - 敏锐的语音编码器（PASE）功能提出一种基于STFT的方法和损失函数，以提高较小数据集的主观质量。我们提出的方法在语音库 +需求数据集上有效，并且与其他最先进的方法相比。我们还实施了TASNET的低延迟版本，我们将其提交给DNS挑战赛，并通过开放式源代码公开。我们的模型在DNS挑战数据集上取得了出色的性能。

Recently, deep neural networks (DNNs) have been successfully used for speech enhancement, and DNN-based speech enhancement is becoming an attractive research area. While time-frequency masking based on the short-time Fourier transform (STFT) has been widely used for DNN-based speech enhancement over the last years, time domain methods such as the time-domain audio separation network (TasNet) have also been proposed. The most suitable method depends on the scale of the dataset and the type of task. In this paper, we explore the best speech enhancement algorithm on two different datasets. We propose a STFT-based method and a loss function using problem-agnostic speech encoder (PASE) features to improve subjective quality for the smaller dataset. Our proposed methods are effective on the Voice Bank + DEMAND dataset and compare favorably to other state-of-the-art methods. We also implement a low-latency version of TasNet, which we submitted to the DNS Challenge and made public by open-sourcing it. Our model achieves excellent performance on the DNS Challenge dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题