用于硬件有效语音触发的混合变压器/CTC网络

论文标题

用于硬件有效语音触发的混合变压器/CTC网络

Hybrid Transformer/CTC Networks for Hardware Efficient Voice Triggering

论文作者

Adya, Saurabh, Garg, Vineet, Sigtia, Siddharth, Simha, Pramod, Dhir, Chandra

论文摘要

我们考虑两次通行语音触发检测系统的设计。我们专注于第二次通过的网络，这些网络用于重新分数从第一频道获得的候选段。我们的基线是一种声学模型（AM），具有Bilstm层，通过最大程度地减少CTC损失而训练。我们用自发层代替Bilstm层。内部评估集的结果表明，自我发挥的网络在需要更少的参数的同时产生更好的准确性。我们在自发层之上添加了自动回归解码器网络，并共同最大程度地减少了编码器上的CTC损失和解码器上的跨凝性损失。该设计对基线产生了进一步的改进。我们在多任务学习（MTL）设置中重新训练上述所有模型，其中共享网络的一个分支被训练为AM，而第二个分支将整个序列分类为真实触发。结果表明，具有自我注意事项层的网络产生$ \ sim $ 60％的$ 60％的假拒绝率相对降低，而给定的错误警报率，同时需要少10％的参数。在MTL设置中进行培训时，自我发项网络会带来进一步的准确性提高。设备测量结果表明，我们观察到推理时间相对减少70％。此外，拟议的网络体系结构的训练速度更快。

We consider the design of two-pass voice trigger detection systems. We focus on the networks in the second pass that are used to re-score candidate segments obtained from the first-pass. Our baseline is an acoustic model(AM), with BiLSTM layers, trained by minimizing the CTC loss. We replace the BiLSTM layers with self-attention layers. Results on internal evaluation sets show that self-attention networks yield better accuracy while requiring fewer parameters. We add an auto-regressive decoder network on top of the self-attention layers and jointly minimize the CTC loss on the encoder and the cross-entropy loss on the decoder. This design yields further improvements over the baseline. We retrain all the models above in a multi-task learning(MTL) setting, where one branch of a shared network is trained as an AM, while the second branch classifies the whole sequence to be true-trigger or not. Results demonstrate that networks with self-attention layers yield $\sim$60% relative reduction in false reject rates for a given false-alarm rate, while requiring 10% fewer parameters. When trained in the MTL setup, self-attention networks yield further accuracy improvements. On-device measurements show that we observe 70% relative reduction in inference time. Additionally, the proposed network architectures are $\sim$5X faster to train.

下载PDF全文

下载文献需遵守相关版权规定

论文标题