通过度量学习改善语音触发检测

论文标题

通过度量学习改善语音触发检测

Improving Voice Trigger Detection with Metric Learning

论文作者

Nayak, Prateeth, Higuchi, Takuya, Gupta, Anmol, Ranjan, Shivesh, Shum, Stephen, Sigtia, Siddharth, Marchi, Erik, Lakshminarasimhan, Varun, Cho, Minsik, Adya, Saurabh, Dhir, Chandra, Tewfik, Ahmed

论文摘要

语音触发检测是一项重要的任务，它可以在目标用户说关键字短语时激活语音助手。通常对探测器进行语音数据培训，而不是说话者信息，并用于语音触发检测任务。但是，这样的说话者独立的语音触发探测器通常会遭受表现不足的群体（例如重音说话者）的言语降解。在这项工作中，我们提出了一种新颖的语音触发探测器，可以使用目标扬声器中的少量话语来提高检测准确性。我们提出的模型采用编码器架构。尽管编码器执行与传统检测器相似的扬声器独立语音触发检测，但解码器预测了每个话语的个性化嵌入。然后获得个性化的语音触发分数，作为注册语音的嵌入与测试话语之间的相似性得分。个性化的嵌入允许在计算语音触发评分时适应目标扬声器的语音，从而提高语音触发检测精度。实验结果表明，与基线扬声器独立的语音触发模型相比，所提出的方法的虚假排斥率（FRR）相对降低了38％。

Voice trigger detection is an important task, which enables activating a voice assistant when a target user speaks a keyword phrase. A detector is typically trained on speech data independent of speaker information and used for the voice trigger detection task. However, such a speaker independent voice trigger detector typically suffers from performance degradation on speech from underrepresented groups, such as accented speakers. In this work, we propose a novel voice trigger detector that can use a small number of utterances from a target speaker to improve detection accuracy. Our proposed model employs an encoder-decoder architecture. While the encoder performs speaker independent voice trigger detection, similar to the conventional detector, the decoder predicts a personalized embedding for each utterance. A personalized voice trigger score is then obtained as a similarity score between the embeddings of enrollment utterances and a test utterance. The personalized embedding allows adapting to target speaker's speech when computing the voice trigger score, hence improving voice trigger detection accuracy. Experimental results show that the proposed approach achieves a 38% relative reduction in a false rejection rate (FRR) compared to a baseline speaker independent voice trigger model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题