RADUR：用于目标声音检测的参考感知和持续时间射击网络

论文标题

RADUR：用于目标声音检测的参考感知和持续时间射击网络

RaDur: A Reference-aware and Duration-robust Network for Target Sound Detection

论文作者

Yang, Dongchao, Wang, Helin, Ye, Zhongjie, Zou, Yuexian, Wang, Wenwu

论文摘要

目标声音检测（TSD）旨在从参考信息中检测到混合音频的目标声音。以前的方法使用条件网络从参考音频中提取声音歧视性嵌入，然后使用它来检测混合音频的目标声音。但是，当使用不同的参考音频时，网络的性能大不相同（例如，对于嘈杂和短期参考音频的效果较差），并且往往会对瞬态事件做出错误的决定（即短于$ 1 $ $秒）。为了克服这些问题，在本文中，我们为TSD提出了一个参考感知和持续时间 - 射击网络（RADUR）。更具体地说，为了使网络更加了解参考信息，我们提出了一个嵌入增强模块，以考虑到混合音频，同时生成嵌入式音频，并应用注意池以增强目标声音相关框架的功能并削弱噪声框架的功能。此外，提出了持续时间射击局部损失，以帮助建模不同的持续事件。为了评估我们的方法，我们基于Urbansound和Audioset构建了两个TSD数据集。广泛的实验显示了我们方法的有效性。

Target sound detection (TSD) aims to detect the target sound from a mixture audio given the reference information. Previous methods use a conditional network to extract a sound-discriminative embedding from the reference audio, and then use it to detect the target sound from the mixture audio. However, the network performs much differently when using different reference audios (e.g. performs poorly for noisy and short-duration reference audios), and tends to make wrong decisions for transient events (i.e. shorter than $1$ second). To overcome these problems, in this paper, we present a reference-aware and duration-robust network (RaDur) for TSD. More specifically, in order to make the network more aware of the reference information, we propose an embedding enhancement module to take into account the mixture audio while generating the embedding, and apply the attention pooling to enhance the features of target sound-related frames and weaken the features of noisy frames. In addition, a duration-robust focal loss is proposed to help model different-duration events. To evaluate our method, we build two TSD datasets based on UrbanSound and Audioset. Extensive experiments show the effectiveness of our methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题