论文标题
MISP挑战2021的视听唤醒单词斑点系统
Audio-Visual Wake Word Spotting System For MISP Challenge 2021
论文作者
论文摘要
本文介绍了我们系统为基于多模式信息的语音处理(MISP)挑战2021的任务1设计的详细信息。在提出的系统中,首先,我们利用语音增强算法(例如波束形成和加权预测误差(WPE))来解决多微晶对话音频。其次,应用了几种数据增强技术来模拟更现实的远场景。对于视频信息,提供的感兴趣区域(ROI)用于获得视觉表示。然后提出了多层CNN来学习音频和视觉表示,并且这些表示形式被馈入我们的基于双分支注意力的网络中,该网络可用于融合,例如变压器和符合。焦点损失用于微调模型并显着提高性能。最后,通过投票投票以达到我们最终的0.091分数来整合多个训练有素的模型。
This paper presents the details of our system designed for the Task 1 of Multimodal Information Based Speech Processing (MISP) Challenge 2021. The purpose of Task 1 is to leverage both audio and video information to improve the environmental robustness of far-field wake word spotting. In the proposed system, firstly, we take advantage of speech enhancement algorithms such as beamforming and weighted prediction error (WPE) to address the multi-microphone conversational audio. Secondly, several data augmentation techniques are applied to simulate a more realistic far-field scenario. For the video information, the provided region of interest (ROI) is used to obtain visual representation. Then the multi-layer CNN is proposed to learn audio and visual representations, and these representations are fed into our two-branch attention-based network which can be employed for fusion, such as transformer and conformed. The focal loss is used to fine-tune the model and improve the performance significantly. Finally, multiple trained models are integrated by casting vote to achieve our final 0.091 score.