S3T：与Swin Transformer进行音乐分类的自我监管的预训练

论文标题

S3T：与Swin Transformer进行音乐分类的自我监管的预训练

S3T: Self-Supervised Pre-training with Swin Transformer for Music Classification

论文作者

Zhao, Hang, Zhang, Chen, Zhu, Belei, Ma, Zejun, Zhang, Kejun

论文摘要

在本文中，我们提出了S3T，这是一种使用Swin Transformer进行音乐分类的自我监督的预训练方法，旨在从大量易于访问的无标记的音乐数据中学习有意义的音乐表示。 S3T引入了基于动量的范式Moco，Swin Transformer作为音乐时频域的功能提取器。为了获得更好的音乐表示，S3T贡献了音乐数据增强管道和两个专门设计的预处理器。据我们所知，S3T是将Swin Transformer与自我监督的学习方法相结合的第一种方法。我们使用经过学习的表示形式培训的线性分类器对音乐流派分类和音乐标记任务进行评估。实验结果表明，S3T在两个任务上分别超过了12.5个Percents TOP-1准确性和4.8 Percents Pr-AUC的S3T优于先前的自我监督方法（CLMR），并且还超过了特定于特定于任务的最先进的监督方法。此外，S3T仅使用10％标记的数据在两个任务上使用100％标记数据的数据显示出标签效率的进步。

In this paper, we propose S3T, a self-supervised pre-training method with Swin Transformer for music classification, aiming to learn meaningful music representations from massive easily accessible unlabeled music data. S3T introduces a momentum-based paradigm, MoCo, with Swin Transformer as its feature extractor to music time-frequency domain. For better music representations learning, S3T contributes a music data augmentation pipeline and two specially designed pre-processors. To our knowledge, S3T is the first method combining the Swin Transformer with a self-supervised learning method for music classification. We evaluate S3T on music genre classification and music tagging tasks with linear classifiers trained on learned representations. Experimental results show that S3T outperforms the previous self-supervised method (CLMR) by 12.5 percents top-1 accuracy and 4.8 percents PR-AUC on two tasks respectively, and also surpasses the task-specific state-of-the-art supervised methods. Besides, S3T shows advances in label efficiency using only 10% labeled data exceeding CLMR on both tasks with 100% labeled data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题