有条件的时间删除的卷积进行声音事件检测

论文标题

有条件的时间删除的卷积进行声音事件检测

Conditioned Time-Dilated Convolutions for Sound Event Detection

论文作者

Drossos, Konstantinos, Mimilakis, Stylianos I., Virtanen, Tuomas

论文摘要

声音事件检测（SED）是识别声音事件及其发作和抵消时间的任务。最近的一种基于卷积神经网络的SED方法提出了深度分离（DWS）和时光卷积的使用。 DWS和时光的卷积为SED产生了最先进的结果，并具有相当少量的参数。在这项工作中，我们提出了通过与SED分类器共同学习的SED预测嵌入来扩展时间删除的卷积。我们提出了一种新型算法，用于调节时间删除的卷积，该卷积与语言建模相似，并增强了这些卷积的性能。我们采用免费可用的TUT-SED合成数据集，并使用10个实验中使用平均每个框架$ \ text {f} _ {1} $得分和平均每个框架错误率来评估方法的性能。 We achieve an increase of 2\% (from 0.63 to 0.65) at the average $\text{F}_{1}$ score (the higher the better) and a decrease of 3\% (from 0.50 to 0.47) at the error rate (the lower the better).

Sound event detection (SED) is the task of identifying sound events along with their onset and offset times. A recent, convolutional neural networks based SED method, proposed the usage of depthwise separable (DWS) and time-dilated convolutions. DWS and time-dilated convolutions yielded state-of-the-art results for SED, with considerable small amount of parameters. In this work we propose the expansion of the time-dilated convolutions, by conditioning them with jointly learned embeddings of the SED predictions by the SED classifier. We present a novel algorithm for the conditioning of the time-dilated convolutions which functions similarly to language modelling, and enhances the performance of the these convolutions. We employ the freely available TUT-SED Synthetic dataset, and we assess the performance of our method using the average per-frame $\text{F}_{1}$ score and average per-frame error rate, over the 10 experiments. We achieve an increase of 2\% (from 0.63 to 0.65) at the average $\text{F}_{1}$ score (the higher the better) and a decrease of 3\% (from 0.50 to 0.47) at the error rate (the lower the better).

下载PDF全文

下载文献需遵守相关版权规定

论文标题