通过弱监督的声音事件检测，野外的语音活动检测

论文标题

通过弱监督的声音事件检测，野外的语音活动检测

Voice activity detection in the wild via weakly supervised sound event detection

论文作者

Dinkel, Heinrich, Chen, Yefei, Wu, Mengyue, Yu, Kai

论文摘要

传统的监督语音活动检测（VAD）方法在清洁和受控方案中很好地工作，并且在现实世界中的应用中的性能严重降低。一种可能的瓶颈是，野外的语音包含不可预测的噪声类型，因此很难帧级标签预测，这是传统监督VAD培训所必需的。相比之下，我们提出了一个通用VAD（GPVAD）框架，该框架可以以弱监督的方式轻松地从嘈杂的数据中训练，仅需要夹子级标签。我们提出了两种GPVAD模型，一种完整的（GPV-F），在527个音频仪声音事件和一个二进制（GPV-B）上进行了训练，只能区分语音和噪音。我们对基于CRNN的标准VAD模型（VAD-C）评估了两个GPV模型（清洁，合成噪声，真实数据）。结果表明，与传统的VAD-C相比，我们提出的GPV-F在清洁和合成方案中表现出竞争性表现。此外，在实际评估中，GPV-F在框架级别评估指标以及段级别的评估指标方面在很大程度上胜过VAD-C。对于框架标记数据的要求要低得多，在现实世界中，幼稚的二进制夹级GPV-B模型仍然可以实现与VAD-C可比的性能。

Traditional supervised voice activity detection (VAD) methods work well in clean and controlled scenarios, with performance severely degrading in real-world applications. One possible bottleneck is that speech in the wild contains unpredictable noise types, hence frame-level label prediction is difficult, which is required for traditional supervised VAD training. In contrast, we propose a general-purpose VAD (GPVAD) framework, which can be easily trained from noisy data in a weakly supervised fashion, requiring only clip-level labels. We proposed two GPVAD models, one full (GPV-F), trained on 527 Audioset sound events, and one binary (GPV-B), only distinguishing speech and noise. We evaluate the two GPV models against a CRNN based standard VAD model (VAD-C) on three different evaluation protocols (clean, synthetic noise, real data). Results show that our proposed GPV-F demonstrates competitive performance in clean and synthetic scenarios compared to traditional VAD-C. Further, in real-world evaluation, GPV-F largely outperforms VAD-C in terms of frame-level evaluation metrics as well as segment-level ones. With a much lower requirement for frame-labeled data, the naive binary clip-level GPV-B model can still achieve comparable performance to VAD-C in real-world scenarios.

下载PDF全文

下载文献需遵守相关版权规定

论文标题