通过部分假设选择的低延迟序列到序列语音识别和翻译

论文标题

通过部分假设选择的低延迟序列到序列语音识别和翻译

Low-Latency Sequence-to-Sequence Speech Recognition and Translation by Partial Hypothesis Selection

论文作者

Liu, Danni, Spanakis, Gerasimos, Niehues, Jan

论文摘要

编码器模型为序列到序列任务（例如语音识别和翻译）提供了通用体系结构。虽然通常会根据质量指标（例如单词错误率（WER）和BLEU）进行离线系统的评估，但延迟也是许多实际用例中的关键因素。我们提出了三种基于块的增量推断的延迟技术，并在准确的延迟权衡方面评估了它们的效率。与离线转录相比，在300小时的How2数据集中，我们通过牺牲1％的WER（6％rel。）将潜伏期降低83％至0.8秒。尽管我们的实验使用变压器，但假设选择策略适用于其他编码器模型。为了避免昂贵的重新计算，我们使用单向上的编码器。在对部分序列进行适应过程之后，单向模型与原始模型进行了PAR。我们进一步表明，我们的方法也适用于低延迟语音翻译。关于2英语 - 葡萄牙语的语音翻译，我们将潜伏期减少到0.7秒（-84％rel。），而与离线系统相比，损失2.4 bleu点（5％相关）。

Encoder-decoder models provide a generic architecture for sequence-to-sequence tasks such as speech recognition and translation. While offline systems are often evaluated on quality metrics like word error rates (WER) and BLEU, latency is also a crucial factor in many practical use-cases. We propose three latency reduction techniques for chunk-based incremental inference and evaluate their efficiency in terms of accuracy-latency trade-off. On the 300-hour How2 dataset, we reduce latency by 83% to 0.8 second by sacrificing 1% WER (6% rel.) compared to offline transcription. Although our experiments use the Transformer, the hypothesis selection strategies are applicable to other encoder-decoder models. To avoid expensive re-computation, we use a unidirectionally-attending encoder. After an adaptation procedure to partial sequences, the unidirectional model performs on-par with the original model. We further show that our approach is also applicable to low-latency speech translation. On How2 English-Portuguese speech translation, we reduce latency to 0.7 second (-84% rel.) while incurring a loss of 2.4 BLEU points (5% rel.) compared to the offline system.

下载PDF全文

下载文献需遵守相关版权规定

论文标题