通用ASR：使用单个编码器模型统一流和非流动ASR

论文标题

通用ASR：使用单个编码器模型统一流和非流动ASR

Universal ASR: Unifying Streaming and Non-Streaming ASR Using a Single Encoder-Decoder Model

论文作者

Gao, Zhifu, Zhang, Shiliang, Lei, Ming, McLoughlin, Ian

论文摘要

最近，在线端到端ASR引起了人们越来越多的关注。但是，在线系统的性能仍然远远落后于离线系统的性能，并具有很大的认可质量。对于特定方案，我们可以在性能和延迟之间进行权衡，并且可以训练具有不同延迟的多个系统以符合各种应用程序场景的性能和延迟要求。在这项工作中，与性能和延迟之间的交易相比，我们设想一个可以符合不同方案需求的系统。我们提出了一种新颖的体系结构，称为通用ASR，可以将流和非流式ASR模型统一为一个系统。嵌入式流媒体ASR模型可以根据要求配置不同的延迟以获得实时识别结果，而非流式模型可以刷新最终识别结果以提高性能。我们已经评估了公共Aishell-2基准和工业级别20,000小时的普通话识别任务的方法。实验结果表明，通用ASR提供了一种有效的机制来整合流媒体和非流式模型，这些模型可以快速准确地识别语音。在Aishell-2任务上，通用ASR舒适地胜过其他最先进的系统。

Recently, online end-to-end ASR has gained increasing attention. However, the performance of online systems still lags far behind that of offline systems, with a large gap in quality of recognition. For specific scenarios, we can trade-off between performance and latency, and can train multiple systems with different delays to match the performance and latency requirements of various application scenarios. In this work, in contrast to trading-off between performance and latency, we envisage a single system that can match the needs of different scenarios. We propose a novel architecture, termed Universal ASR that can unify streaming and non-streaming ASR models into one system. The embedded streaming ASR model can configure different delays according to requirements to obtain real-time recognition results, while the non-streaming model is able to refresh the final recognition result for better performance. We have evaluated our approach on the public AISHELL-2 benchmark and an industrial-level 20,000-hour Mandarin speech recognition task. The experimental results show that the Universal ASR provides an efficient mechanism to integrate streaming and non-streaming models that can recognize speech quickly and accurately. On the AISHELL-2 task, Universal ASR comfortably outperforms other state-of-the-art systems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题