论文标题

在现场音乐视频流中,用于视听语音活动检测的规则包裹网络

Rule-embedded network for audio-visual voice activity detection in live musical video streams

论文作者

Hou, Yuanbo, Deng, Yi, Zhu, Bilei, Ma, Zejun, Botteldooren, Dick

论文摘要

在现场音乐流中检测Anchor的声音是音乐和语音信号处理的重要预处理。现有的语音活动检测方法(VAD)主要依赖于音频,但是,基于音频的VAD很难在嘈杂的环境中有效地关注目标语音。在视觉信息的帮助下,本文提出了一个规则包裹的网络,以融合视听(A-V)输入,以帮助模型更好地检测目标语音。该规则在模型中的核心作用是协调双模式信息之间的关系,并将视觉表示形式用作掩盖,以滤除非目标声音的信息。实验表明:1)借助拟议规则的跨模式融合,A-V分支的检测结果优于音频分支的检测结果; 2)双模式模型的性能远远超过了仅音频模型的表现,这表明音频和视觉信号的结合对VAD非常有益。为了吸引人们对跨模式音乐和音频信号处理的更多关注,引入了带有框架级标签的新现场音乐视频语料库。

Detecting anchor's voice in live musical streams is an important preprocessing for music and speech signal processing. Existing approaches to voice activity detection (VAD) primarily rely on audio, however, audio-based VAD is difficult to effectively focus on the target voice in noisy environments. With the help of visual information, this paper proposes a rule-embedded network to fuse the audio-visual (A-V) inputs to help the model better detect target voice. The core role of the rule in the model is to coordinate the relation between the bi-modal information and use visual representations as the mask to filter out the information of non-target sound. Experiments show that: 1) with the help of cross-modal fusion by the proposed rule, the detection result of A-V branch outperforms that of audio branch; 2) the performance of bi-modal model far outperforms that of audio-only models, indicating that the incorporation of both audio and visual signals is highly beneficial for VAD. To attract more attention to the cross-modal music and audio signal processing, a new live musical video corpus with frame-level label is introduced.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源