论文标题
FSD50K:人类标记的声音事件的开放数据集
FSD50K: An Open Dataset of Human-Labeled Sound Events
论文作者
论文摘要
基于YouTube视频中的2M轨道,大多数现有的用于声音事件识别(SER)的数据集(SER)相对较小和/或特定于域特异性数据集,而Audioset除外,并涵盖了500多个声音类。但是,Audioset并不是开放数据集,因为其官方版本由预计的音频功能组成。由于YouTube视频逐渐消失和用法权利问题,下载原始音轨可能会出现问题。为了提供替代基准数据集并因此促进了SER研究,我们介绍了FSD50K,这是一个开放数据集,该数据集包含超过51k的音频剪辑,总计超过100h的音频,并使用来自Audioset本体学的200类手动标记的音频。音频剪辑在创意共享许可下获得许可,使数据集可自由分发(包括波形)。我们提供了FSD50K创建过程的详细描述,该过程是针对自由数据的特殊性量身定制的,包括遇到的挑战和采用的解决方案。我们包括一个全面的数据集表征以及对限制和关键因素的讨论,以允许其音频信息使用。最后,我们进行声音事件分类实验,以提供基线系统以及对SER分开自由音频数据时要考虑的主要因素。我们的目标是开发一个数据集,该数据集被社区广泛采用,作为SER研究的新开放基准。
Most existing datasets for sound event recognition (SER) are relatively small and/or domain-specific, with the exception of AudioSet, based on over 2M tracks from YouTube videos and encompassing over 500 sound classes. However, AudioSet is not an open dataset as its official release consists of pre-computed audio features. Downloading the original audio tracks can be problematic due to YouTube videos gradually disappearing and usage rights issues. To provide an alternative benchmark dataset and thus foster SER research, we introduce FSD50K, an open dataset containing over 51k audio clips totalling over 100h of audio manually labeled using 200 classes drawn from the AudioSet Ontology. The audio clips are licensed under Creative Commons licenses, making the dataset freely distributable (including waveforms). We provide a detailed description of the FSD50K creation process, tailored to the particularities of Freesound data, including challenges encountered and solutions adopted. We include a comprehensive dataset characterization along with discussion of limitations and key factors to allow its audio-informed usage. Finally, we conduct sound event classification experiments to provide baseline systems as well as insight on the main factors to consider when splitting Freesound audio data for SER. Our goal is to develop a dataset to be widely adopted by the community as a new open benchmark for SER research.