通过注意转移的顺序到序列学习以增量语音识别

论文标题

通过注意转移的顺序到序列学习以增量语音识别

Sequence-to-Sequence Learning via Attention Transfer for Incremental Speech Recognition

论文作者

Novitasari, Sashi, Tjandra, Andros, Sakti, Sakriani, Nakamura, Satoshi

论文摘要

基于注意力的序列到序列自动语音识别（ASR）需要明显的延迟才能识别长话语，因为在接收整个输入序列后会产生输出。尽管最近的几项研究提出了用于增量语音识别（ISR）的序列机制，但使用不同的框架和学习算法比标准ASR模型更为复杂。一个主要原因是因为该模型需要决定增量步骤，并了解与当前短语音段保持一致的转录。在这项工作中，我们调查是否可以通过将基于注意力的ASR的原始体系结构用于ISR任务，以将全含ASR视为教师模型和ISR作为学生模型。我们设计了一个替代的学生网络，该网络不是使用较薄或较浅的模型，而是保留教师模型的原始体系结构，但具有较短的序列（少数编码器和解码器状态）。使用注意转移，学生网络学会模仿当前输入短语音段和转录之间相同的对齐。我们的实验表明，通过将识别过程的启动时间延迟约1.7秒，我们可以实现与需要等到结束的能力相当的性能。

Attention-based sequence-to-sequence automatic speech recognition (ASR) requires a significant delay to recognize long utterances because the output is generated after receiving entire input sequences. Although several studies recently proposed sequence mechanisms for incremental speech recognition (ISR), using different frameworks and learning algorithms is more complicated than the standard ASR model. One main reason is because the model needs to decide the incremental steps and learn the transcription that aligns with the current short speech segment. In this work, we investigate whether it is possible to employ the original architecture of attention-based ASR for ISR tasks by treating a full-utterance ASR as the teacher model and the ISR as the student model. We design an alternative student network that, instead of using a thinner or a shallower model, keeps the original architecture of the teacher model but with shorter sequences (few encoder and decoder states). Using attention transfer, the student network learns to mimic the same alignment between the current input short speech segments and the transcription. Our experiments show that by delaying the starting time of recognition process with about 1.7 sec, we can achieve comparable performance to one that needs to wait until the end.

下载PDF全文

下载文献需遵守相关版权规定

论文标题