CATT-KWS：基于级联传感器转换器的多阶段自定义关键字发现框架

论文标题

CATT-KWS：基于级联传感器转换器的多阶段自定义关键字发现框架

CaTT-KWS: A Multi-stage Customized Keyword Spotting Framework based on Cascaded Transducer-Transformer

论文作者

Yang, Zhanheng, Sun, Sining, Li, Jin, Zhang, Xiaoming, Wang, Xiong, Ma, Long, Xie, Lei

论文摘要

自定义的关键字发现（KWS）有很大的潜力被部署在边缘设备上，以实现免提用户体验。但是，在实际应用程序中，错误警报（FA）对于发现数十个甚至数百个关键字的严重问题，这会极大地影响用户体验。为了解决这个问题，在本文中，我们利用了基于换能器和变压器的声学模型的最新进展，并提出了一个新的多阶段定制的KWS框架，称为cascaded tandducer-transform-transformer kws（Catt-kws），其中包括基于transducer的基于框架的基于框架的基于型号的基于型号的基于trageer topledor todledor todledor toffer tovelor pureder modules and transformite and Transformite and transformite。具体而言，流式传感器模块用于在音频流中发现关键字候选。然后，使用电话预测器预测的手机后期来实现力对准，以完成第一阶段关键字验证并完善关键字的时间范围。最后，变压器解码器进一步验证了触发的关键字。我们提出的CATT-KWS框架有效地降低了FA速率，而显然会损害关键字识别准确性。具体而言，在具有挑战性的数据集中，我们可以每小时获得令人印象深刻的0.13 FA，与基于传感器的检测模型相比，FA的相对减少超过90％，而关键字识别精度仅下降了2％。

Customized keyword spotting (KWS) has great potential to be deployed on edge devices to achieve hands-free user experience. However, in real applications, false alarm (FA) would be a serious problem for spotting dozens or even hundreds of keywords, which drastically affects user experience. To solve this problem, in this paper, we leverage the recent advances in transducer and transformer based acoustic models and propose a new multi-stage customized KWS framework named Cascaded Transducer-Transformer KWS (CaTT-KWS), which includes a transducer based keyword detector, a frame-level phone predictor based force alignment module and a transformer based decoder. Specifically, the streaming transducer module is used to spot keyword candidates in audio stream. Then force alignment is implemented using the phone posteriors predicted by the phone predictor to finish the first stage keyword verification and refine the time boundaries of keyword. Finally, the transformer decoder further verifies the triggered keyword. Our proposed CaTT-KWS framework reduces FA rate effectively without obviously hurting keyword recognition accuracy. Specifically, we can get impressively 0.13 FA per hour on a challenging dataset, with over 90% relative reduction on FA comparing to the transducer based detection model, while keyword recognition accuracy only drops less than 2%.

下载PDF全文

下载文献需遵守相关版权规定

论文标题