频谱处理策略的声学场景分类

论文标题

频谱处理策略的声学场景分类

Acoustic Scene Classification with Spectrogram Processing Strategies

论文作者

Wang, Helin, Zou, Yuexian, Chong, Dading

论文摘要

最近，卷积神经网络（CNN）已在声学场景分类（ASC）任务中实现了最新的性能。音频数据通常被转换为二维光谱图表示，然后将其馈送到神经网络。在本文中，我们研究了通过歧视性处理策略有效利用不同频谱图表示的问题。有两个主要贡献。第一个贡献是探索在不同阶段多光谱图的组合的影响，这为有效的光谱图融合提供了有意义的参考。第二个贡献是提出了多个频段和多个时间帧中的处理策略，以完全使用单个频谱图表示。提出的频谱图处理策略可以很容易地转移到任何网络结构上。实验是在Dcase 2020 Task1数据集上进行的，结果表明，我们的方法可以达到81.8％（官方基线：54.1％）和92.1％（官方基线：87.3％）的准确性1分别提供了折叠1的折叠数据集，分别是Tast1A和Task1B的评估数据集。

Recently, convolutional neural networks (CNN) have achieved the state-of-the-art performance in acoustic scene classification (ASC) task. The audio data is often transformed into two-dimensional spectrogram representations, which are then fed to the neural networks. In this paper, we study the problem of efficiently taking advantage of different spectrogram representations through discriminative processing strategies. There are two main contributions. The first contribution is exploring the impact of the combination of multiple spectrogram representations at different stages, which provides a meaningful reference for the effective spectrogram fusion. The second contribution is that the processing strategies in multiple frequency bands and multiple temporal frames are proposed to make fully use of a single spectrogram representation. The proposed spectrogram processing strategies can be easily transferred to any network structures. The experiments are carried out on the DCASE 2020 Task1 datasets, and the results show that our method could achieve the accuracy of 81.8% (official baseline: 54.1%) and 92.1% (official baseline: 87.3%) on the officially provided fold 1 evaluation dataset of Task1A and Task1B, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题