论文标题
转移学习以改善多音和音乐中的唱歌声音检测
Transfer Learning for Improving Singing-voice Detection in Polyphonic Instrumental Music
论文作者
论文摘要
在复音乐器音乐中检测唱歌声音对于音乐信息检索至关重要。为了训练强大的人声探测器,在框架级别上标有声音或非声音标签的大型数据集是必不可少的。但是,框架级别的标签是耗时且劳动力昂贵,因此几乎没有标记的数据集可用于唱歌式检测(S-VD)。因此,我们通过转移学习提出了针对S-VD的数据增强方法。在这项研究中,用语音活动端点和单独的乐器音乐剪辑人为地添加了带有语音活动端点的干净言语剪辑,以模拟多音的声音以训练声带/非声音探测器。由于说话和唱歌之间的发音和发音不同,经过人工数据集训练的声带探测器与多音音乐与乐器伴奏一起唱歌的声音音乐不太吻合。为了减少这一不匹配,转移学习用于将从人工语音 - 音乐训练集中学到的知识转移到一个小但匹配的多个数据集中,即带有伴奏的歌唱人声。通过转移相关知识以弥补S-VD中缺乏标记的培训数据,通过转移学习提出的数据增强方法可以通过从89.5%提高到93.2%,可以提高S-VD性能。
Detecting singing-voice in polyphonic instrumental music is critical to music information retrieval. To train a robust vocal detector, a large dataset marked with vocal or non-vocal label at frame-level is essential. However, frame-level labeling is time-consuming and labor expensive, resulting there is little well-labeled dataset available for singing-voice detection (S-VD). Hence, we propose a data augmentation method for S-VD by transfer learning. In this study, clean speech clips with voice activity endpoints and separate instrumental music clips are artificially added together to simulate polyphonic vocals to train a vocal/non-vocal detector. Due to the different articulation and phonation between speaking and singing, the vocal detector trained with the artificial dataset does not match well with the polyphonic music which is singing vocals together with the instrumental accompaniments. To reduce this mismatch, transfer learning is used to transfer the knowledge learned from the artificial speech-plus-music training set to a small but matched polyphonic dataset, i.e., singing vocals with accompaniments. By transferring the related knowledge to make up for the lack of well-labeled training data in S-VD, the proposed data augmentation method by transfer learning can improve S-VD performance with an F-score improvement from 89.5% to 93.2%.