在线多通道语音分离的端到端体系结构

论文标题

在线多通道语音分离的端到端体系结构

An End-to-end Architecture of Online Multi-channel Speech Separation

论文作者

Wu, Jian, Chen, Zhuo, Li, Jinyu, Yoshioka, Takuya, Tan, Zhili, Lin, Ed, Luo, Yi, Xie, Lei

论文摘要

多演讲者的语音识别一直是对话转录中的关键促进词之一，因为它打破了大多数最新的急诊识别系统所采用的单次式说话者的假设。语音分离被认为是解决这个问题的一种补救措施。以前，我们引入了一个SYS-TEM，称为固定固定，固定链式和脱术（UFE），该系统在对话转录中被证明有效解决了语音过键问题。使用UFE，通过固定光束形成器处理输入的信号，然后进行胸膜网络后滤波。尽管有令人鼓舞的结果，但该系统包含多个单独开发的模块，导致了潜在的次级最低性能。在此工作中，我们介绍了UFE的端到端建模版本。提出了一个可观的梯度繁殖，提出了一个注意选择模块，其中学习了注意力范围的注意力，并在空间上采样了空间特征。前体验结果表明，所提出的系统在离线评估中与原始分离的基于基于处理的管道实现了可靠的性能，同时在在线评估中产生了可观的改进。

Multi-speaker speech recognition has been one of the keychallenges in conversation transcription as it breaks the singleactive speaker assumption employed by most state-of-the-artspeech recognition systems. Speech separation is consideredas a remedy to this problem. Previously, we introduced a sys-tem, calledunmixing,fixed-beamformerandextraction(UFE),that was shown to be effective in addressing the speech over-lap problem in conversation transcription. With UFE, an inputmixed signal is processed by fixed beamformers, followed by aneural network post filtering. Although promising results wereobtained, the system contains multiple individually developedmodules, leading potentially sub-optimum performance. In thiswork, we introduce an end-to-end modeling version of UFE. Toenable gradient propagation all the way, an attentional selectionmodule is proposed, where an attentional weight is learnt foreach beamformer and spatial feature sampled over space. Ex-perimental results show that the proposed system achieves com-parable performance in an offline evaluation with the originalseparate processing-based pipeline, while producing remark-able improvements in an online evaluation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题