BYOL-S：通过自举学习自我监督的语音表示形式

论文标题

BYOL-S：通过自举学习自我监督的语音表示形式

BYOL-S: Learning Self-supervised Speech Representations by Bootstrapping

论文作者

Elbanna, Gasser, Scheidwasser-Clow, Neil, Kegler, Mikolaj, Beckmann, Pierre, Hajal, Karl El, Cernak, Milos

论文摘要

自从几十年前的频谱分析开创性工作以来，已经研究了提取音频和语音特征的方法。最近的努力以开发通用音频表示的野心为指导。例如，如果深度神经网络在大型音频数据集上进行了培训，则可以提取最佳的嵌入。这项工作扩展了基于自我监督的学习，通过引导，提出了各种编码器体系结构，并探讨了使用不同的预训练数据集的效果。最后，我们提出了一个新颖的培训框架，以提出一个混合音频表示，该框架结合了手工制作和数据驱动的学习音频功能。在HEAR NEURIP 2021挑战中，对听觉场景分类和时间戳检测任务进行了评估。我们的结果表明，在大多数听到挑战任务中，带有卷积变压器的混合模型会产生卓越的性能。

Methods for extracting audio and speech features have been studied since pioneering work on spectrum analysis decades ago. Recent efforts are guided by the ambition to develop general-purpose audio representations. For example, deep neural networks can extract optimal embeddings if they are trained on large audio datasets. This work extends existing methods based on self-supervised learning by bootstrapping, proposes various encoder architectures, and explores the effects of using different pre-training datasets. Lastly, we present a novel training framework to come up with a hybrid audio representation, which combines handcrafted and data-driven learned audio features. All the proposed representations were evaluated within the HEAR NeurIPS 2021 challenge for auditory scene classification and timestamp detection tasks. Our results indicate that the hybrid model with a convolutional transformer as the encoder yields superior performance in most HEAR challenge tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题