ORTHRO：非双重编码器的非自动回归端到端语音翻译

论文标题

ORTHRO：非双重编码器的非自动回归端到端语音翻译

Orthros: Non-autoregressive End-to-end Speech Translation with Dual-decoder

论文作者

Inaguma, Hirofumi, Higuchi, Yosuke, Duh, Kevin, Kawahara, Tatsuya, Watanabe, Shinji

论文摘要

快速推理速度是实现语音翻译（ST）系统现实部署的重要目标。基于编码器 - 编码器架构的端到端（E2E）模型比传统的级联系统更适合该目标，但是到目前为止，尚未探索它们在解码速度方面的有效性。受到基于文本翻译的非自动回归（NAR）方法的最新进展的启发，该方法通过消除条件依赖性来平行生成目标令牌，我们研究了NAR解码E2E-ST的问题。我们提出了一个新颖的NAR E2E-ST框架，Orthros，其中NAR和自动回归（AR）解码器都经过共享的语音编码者的共同培训。后者用于选择从前者产生的各种长度候选者之间的更好的翻译，这极大地提高了大长度梁的有效性，而大长束的开销可忽略不计。我们进一步研究了语音输入和词汇大小的影响的有效长度预测方法。与最先进的AR E2E-ST系统相比，在四个基准测试的实验表明了该方法在提高推理速度方面的有效性，同时保持竞争性翻译质量。

Fast inference speed is an important goal towards real-world deployment of speech translation (ST) systems. End-to-end (E2E) models based on the encoder-decoder architecture are more suitable for this goal than traditional cascaded systems, but their effectiveness regarding decoding speed has not been explored so far. Inspired by recent progress in non-autoregressive (NAR) methods in text-based translation, which generates target tokens in parallel by eliminating conditional dependencies, we study the problem of NAR decoding for E2E-ST. We propose a novel NAR E2E-ST framework, Orthros, in which both NAR and autoregressive (AR) decoders are jointly trained on the shared speech encoder. The latter is used for selecting better translation among various length candidates generated from the former, which dramatically improves the effectiveness of a large length beam with negligible overhead. We further investigate effective length prediction methods from speech inputs and the impact of vocabulary sizes. Experiments on four benchmarks show the effectiveness of the proposed method in improving inference speed while maintaining competitive translation quality compared to state-of-the-art AR E2E-ST systems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题