论文标题
通过弱监督的声音事件检测,野外的语音活动检测
Voice activity detection in the wild via weakly supervised sound event detection
论文作者
论文摘要
传统的监督语音活动检测(VAD)方法在清洁和受控方案中很好地工作,并且在现实世界中的应用中的性能严重降低。一种可能的瓶颈是,野外的语音包含不可预测的噪声类型,因此很难帧级标签预测,这是传统监督VAD培训所必需的。相比之下,我们提出了一个通用VAD(GPVAD)框架,该框架可以以弱监督的方式轻松地从嘈杂的数据中训练,仅需要夹子级标签。我们提出了两种GPVAD模型,一种完整的(GPV-F),在527个音频仪声音事件和一个二进制(GPV-B)上进行了训练,只能区分语音和噪音。我们对基于CRNN的标准VAD模型(VAD-C)评估了两个GPV模型(清洁,合成噪声,真实数据)。结果表明,与传统的VAD-C相比,我们提出的GPV-F在清洁和合成方案中表现出竞争性表现。此外,在实际评估中,GPV-F在框架级别评估指标以及段级别的评估指标方面在很大程度上胜过VAD-C。对于框架标记数据的要求要低得多,在现实世界中,幼稚的二进制夹级GPV-B模型仍然可以实现与VAD-C可比的性能。
Traditional supervised voice activity detection (VAD) methods work well in clean and controlled scenarios, with performance severely degrading in real-world applications. One possible bottleneck is that speech in the wild contains unpredictable noise types, hence frame-level label prediction is difficult, which is required for traditional supervised VAD training. In contrast, we propose a general-purpose VAD (GPVAD) framework, which can be easily trained from noisy data in a weakly supervised fashion, requiring only clip-level labels. We proposed two GPVAD models, one full (GPV-F), trained on 527 Audioset sound events, and one binary (GPV-B), only distinguishing speech and noise. We evaluate the two GPV models against a CRNN based standard VAD model (VAD-C) on three different evaluation protocols (clean, synthetic noise, real data). Results show that our proposed GPV-F demonstrates competitive performance in clean and synthetic scenarios compared to traditional VAD-C. Further, in real-world evaluation, GPV-F largely outperforms VAD-C in terms of frame-level evaluation metrics as well as segment-level ones. With a much lower requirement for frame-labeled data, the naive binary clip-level GPV-B model can still achieve comparable performance to VAD-C in real-world scenarios.