论文标题
通过部分假设选择的低延迟序列到序列语音识别和翻译
Low-Latency Sequence-to-Sequence Speech Recognition and Translation by Partial Hypothesis Selection
论文作者
论文摘要
编码器模型为序列到序列任务(例如语音识别和翻译)提供了通用体系结构。虽然通常会根据质量指标(例如单词错误率(WER)和BLEU)进行离线系统的评估,但延迟也是许多实际用例中的关键因素。我们提出了三种基于块的增量推断的延迟技术,并在准确的延迟权衡方面评估了它们的效率。与离线转录相比,在300小时的How2数据集中,我们通过牺牲1%的WER(6%rel。)将潜伏期降低83%至0.8秒。尽管我们的实验使用变压器,但假设选择策略适用于其他编码器模型。为了避免昂贵的重新计算,我们使用单向上的编码器。在对部分序列进行适应过程之后,单向模型与原始模型进行了PAR。我们进一步表明,我们的方法也适用于低延迟语音翻译。关于2英语 - 葡萄牙语的语音翻译,我们将潜伏期减少到0.7秒(-84%rel。),而与离线系统相比,损失2.4 bleu点(5%相关)。
Encoder-decoder models provide a generic architecture for sequence-to-sequence tasks such as speech recognition and translation. While offline systems are often evaluated on quality metrics like word error rates (WER) and BLEU, latency is also a crucial factor in many practical use-cases. We propose three latency reduction techniques for chunk-based incremental inference and evaluate their efficiency in terms of accuracy-latency trade-off. On the 300-hour How2 dataset, we reduce latency by 83% to 0.8 second by sacrificing 1% WER (6% rel.) compared to offline transcription. Although our experiments use the Transformer, the hypothesis selection strategies are applicable to other encoder-decoder models. To avoid expensive re-computation, we use a unidirectionally-attending encoder. After an adaptation procedure to partial sequences, the unidirectional model performs on-par with the original model. We further show that our approach is also applicable to low-latency speech translation. On How2 English-Portuguese speech translation, we reduce latency to 0.7 second (-84% rel.) while incurring a loss of 2.4 BLEU points (5% rel.) compared to the offline system.