论文标题
多模式目标语音分离,语音和面部参考
Multimodal Target Speech Separation with Voice and Face References
论文作者
论文摘要
目标语音分离是指通过对目标扬声器的辅助信息进行调节,将目标语音与多演讲者混合信号隔离。与通常需要同时视觉流的主流视听方法不同,例如其他输入,例如相应的唇部运动序列,在我们的方法中,我们提出了目标扬声器的单个面轮廓的新颖使用,以分离预期的清洁语音。我们利用了一个事实,即面部图像包含有关该人的语音的信息。与使用同时的视觉序列相比,通过注册或网站上更容易获得面部图像,这使系统能够将其推广到没有摄像头的设备。为此,我们将从预处理的模型中提取的面部识别提取的面部嵌入到语音分离中,这些嵌入方式指导系统预测时间频域中的目标扬声器掩码。实验结果表明,预先注射的面部图像能够受益于分离的预期语音信号。此外,面部信息是语音参考的补充,我们表明在梳理面部和语音嵌入时可以进一步改进。
Target speech separation refers to isolating target speech from a multi-speaker mixture signal by conditioning on auxiliary information about the target speaker. Different from the mainstream audio-visual approaches which usually require simultaneous visual streams as additional input, e.g. the corresponding lip movement sequences, in our approach we propose the novel use of a single face profile of the target speaker to separate expected clean speech. We exploit the fact that the image of a face contains information about the person's speech sound. Compared to using a simultaneous visual sequence, a face image is easier to obtain by pre-enrollment or on websites, which enables the system to generalize to devices without cameras. To this end, we incorporate face embeddings extracted from a pretrained model for face recognition into the speech separation, which guide the system in predicting a target speaker mask in the time-frequency domain. The experimental results show that a pre-enrolled face image is able to benefit separating expected speech signals. Additionally, face information is complementary to voice reference and we show that further improvement can be achieved when combing both face and voice embeddings.