略读：跳过记忆LSTM，用于低延迟实时连续语音分离

论文标题

略读：跳过记忆LSTM，用于低延迟实时连续语音分离

SkiM: Skipping Memory LSTM for Low-Latency Real-Time Continuous Speech Separation

论文作者

Li, Chenda, Yang, Lei, Wang, Weiqin, Qian, Yanmin

论文摘要

持续的语音分离以实现预处理，最近已成为一个重点的研究主题。与语音级别的语音分离中的数据相比，会议式音频流的持续时间更长，具有不确定的扬声器。我们采用了时间域语音分离方法和最近提出的图形 - 构建超级低延迟的在线语音分离模型，这对于实际应用非常重要。较小步幅的低延迟时间域编码器导致非常长的特征序列。我们为长序列建模提出了一个名为Skimpiping Memory（Skim）的简单而有效的模型。实验结果表明，脱速度比DPRNN上的分离性能甚至更好的分离性能。同时，与DPRNN相比，脱脂的计算成本降低了75％。强大的长序列建模能力和低计算成本使浏览成为在线CSS应用程序的合适模型。我们最快的实时模型在模拟的会议式评估中获得了17.1 db信噪比（SDR）的改进，且延迟少于1毫秒。

Continuous speech separation for meeting pre-processing has recently become a focused research topic. Compared to the data in utterance-level speech separation, the meeting-style audio stream lasts longer, has an uncertain number of speakers. We adopt the time-domain speech separation method and the recently proposed Graph-PIT to build a super low-latency online speech separation model, which is very important for the real application. The low-latency time-domain encoder with a small stride leads to an extremely long feature sequence. We proposed a simple yet efficient model named Skipping Memory (SkiM) for the long sequence modeling. Experimental results show that SkiM achieves on par or even better separation performance than DPRNN. Meanwhile, the computational cost of SkiM is reduced by 75% compared to DPRNN. The strong long sequence modeling capability and low computational cost make SkiM a suitable model for online CSS applications. Our fastest real-time model gets 17.1 dB signal-to-distortion (SDR) improvement with less than 1-millisecond latency in the simulated meeting-style evaluation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题