论文标题
带有神经传感器的大规模流端到端语音翻译
Large-Scale Streaming End-to-End Speech Translation with Neural Transducers
论文作者
论文摘要
神经传感器已被广泛用于自动语音识别(ASR)。在本文中,我们将其介绍给流媒体端到端的语音翻译(ST),该语音旨在将音频信号直接转换为其他语言的文本。与执行ASR随后是基于文本的机器翻译(MT)的级联ST相比,所提出的基于Transformer transfer(TT)的ST模型大大降低了推理潜伏期,利用语音信息并避免了ASR的错误传播到MT。为了提高建模能力,我们提出了针对TT联合网络的注意集合。此外,我们将基于TT的ST扩展到多语言ST,该ST同时生成多种语言的文本。大规模伪标记的训练组的实验结果表明,基于TT的ST不仅显着减少了推理时间,而且还优于非流式级联ST进行英语 - 德语翻译。
Neural transducers have been widely used in automatic speech recognition (ASR). In this paper, we introduce it to streaming end-to-end speech translation (ST), which aims to convert audio signals to texts in other languages directly. Compared with cascaded ST that performs ASR followed by text-based machine translation (MT), the proposed Transformer transducer (TT)-based ST model drastically reduces inference latency, exploits speech information, and avoids error propagation from ASR to MT. To improve the modeling capacity, we propose attention pooling for the joint network in TT. In addition, we extend TT-based ST to multilingual ST, which generates texts of multiple languages at the same time. Experimental results on a large-scale 50 thousand (K) hours pseudo-labeled training set show that TT-based ST not only significantly reduces inference time but also outperforms non-streaming cascaded ST for English-German translation.