针对基于变压器的语音识别的弱注意抑制

论文标题

针对基于变压器的语音识别的弱注意抑制

Weak-Attention Suppression For Transformer Based Speech Recognition

论文作者

Shi, Yangyang, Wang, Yongqiang, Wu, Chunyang, Fuegen, Christian, Zhang, Frank, Le, Duc, Yeh, Ching-Feng, Seltzer, Michael L.

论文摘要

最初提出的自然语言处理（NLP）任务的变压器最近在自动语音识别（ASR）方面取得了巨大成功。但是，相邻的声学单元（即帧）高度相关，与文本单元不同，它们之间的长距离依赖性很弱。这表明ASR可能会受益于稀疏和局部关注。在本文中，我们提出了弱注意抑制（WAS），该方法会动态引起注意力概率的稀疏性。我们证明这是导致对强型变压器基线的一致单词错误率（WER）的一致性提高。在广泛使用的LiblisPeech基准测试中，我们提出的方法在测试清洁上将WER降低了10％$，对于流媒体变压器而言，在测试中降低了5％，从而在流媒体模型中产生了新的最新技术。进一步的分析表明，这是学会抑制非批判性和多余的连续声学框架的注意，并且更有可能抑制过去的框架而不是将来的框架。它表明LookAhead在基于注意的ASR模型中的重要性。

Transformers, originally proposed for natural language processing (NLP) tasks, have recently achieved great success in automatic speech recognition (ASR). However, adjacent acoustic units (i.e., frames) are highly correlated, and long-distance dependencies between them are weak, unlike text units. It suggests that ASR will likely benefit from sparse and localized attention. In this paper, we propose Weak-Attention Suppression (WAS), a method that dynamically induces sparsity in attention probabilities. We demonstrate that WAS leads to consistent Word Error Rate (WER) improvement over strong transformer baselines. On the widely used LibriSpeech benchmark, our proposed method reduced WER by 10%$ on test-clean and 5% on test-other for streamable transformers, resulting in a new state-of-the-art among streaming models. Further analysis shows that WAS learns to suppress attention of non-critical and redundant continuous acoustic frames, and is more likely to suppress past frames rather than future ones. It indicates the importance of lookahead in attention-based ASR models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题