目标声音检测的混合监督学习框架

论文标题

目标声音检测的混合监督学习框架

A Mixed supervised Learning Framework for Target Sound Detection

论文作者

Yang, Dongchao, Wang, Helin, Zou, Yuexian, Wang, Wenwu

论文摘要

鉴于参考信息，目标声音检测（TSD）旨在从混合音频中检测目标声音。先前的工作表明，可以对TSD模型进行全面通知（帧级标签）或弱注册（剪贴级标签）数据培训。但是，有一些明确的证据表明，在弱注册数据上训练的模型的性能要比对完全注销的数据进行培训的模型要差。为了填补这一空白，我们提供了混合的监督观点，其中在现有基本类别（源域）的完整注释的帮助下，使用弱注释来学习新颖的类别（目标域）。为了意识到这一点，提出了一个混合的监督学习框架，其中包含两个相互策划的学生模型（\ textIt {f \ _ student}和\ textit {w \ _student}），分别从完全注重和弱的数据中学习。动机是\ textIt {f \ _student}从完全注释的数据中学到的学到的能力比\ textIt {w \ _student}更好地捕获详细信息。因此，我们首先让\ textit {f \ _student}指南\ textit {w \ _student}学习捕获细节的能力，因此\ textit {w \ _sTudent}可以在目标域中表现更好。然后，我们让\ textit {w \ _student}指南\ textit {f \ _student}在目标域上微调。该过程可以重复几次，以便两个学生在目标域中表现良好。为了评估我们的方法，我们基于Urbansound和Audioset构建了三个TSD数据集。实验结果表明，与最近的基线相比，我们的方法在基于事件的F评分中提供了约8％的改善。

Target sound detection (TSD) aims to detect the target sound from mixture audio given the reference information. Previous works have shown that TSD models can be trained on fully-annotated (frame-level label) or weakly-annotated (clip-level label) data. However, there are some clear evidences show that the performance of the model trained on weakly-annotated data is worse than that trained on fully-annotated data. To fill this gap, we provide a mixed supervision perspective, in which learning novel categories (target domain) using weak annotations with the help of full annotations of existing base categories (source domain). To realize this, a mixed supervised learning framework is proposed, which contains two mutually-helping student models (\textit{f\_student} and \textit{w\_student}) that learn from fully-annotated and weakly-annotated data, respectively. The motivation is that \textit{f\_student} learned from fully-annotated data has a better ability to capture detailed information than \textit{w\_student}. Thus, we first let \textit{f\_student} guide \textit{w\_student} to learn the ability to capture details, so \textit{w\_student} can perform better in the target domain. Then we let \textit{w\_student} guide \textit{f\_student} to fine-tune on the target domain. The process can be repeated several times so that the two students perform very well in the target domain. To evaluate our method, we built three TSD datasets based on UrbanSound and Audioset. Experimental results show that our methods offer about 8\% improvement in event-based F-score as compared with a recent baseline.

下载PDF全文

下载文献需遵守相关版权规定

论文标题