针对音频的对比度无监督学习

论文标题

针对音频的对比度无监督学习

Contrastive Unsupervised Learning for Audio Fingerprinting

论文作者

Yu, Zhesong, Du, Xingjian, Zhu, Bilei, Ma, Zejun

论文摘要

视频共享平台的兴起吸引了越来越多的人拍摄视频并将其上传到互联网。这些视频主要包含精心编辑的背景音轨，其中可能涉及严重的语音变化，音调变化和各种音频效果，并且现有的音频标识系统可能无法识别音频。为了解决这个问题，在本文中，我们将对比学习的想法介绍给音频指纹（AFP）的任务。对比学习是一种无监督的方法，用于学习表述，可以有效地分组相似的样本并区分不同的样本。在我们的工作中，我们将音轨及其不同的扭曲版本视为相似的，同时将不同的音轨视为不同。基于动量对比（MOCO）框架，我们为AFP设计了一种对比度学习方法，该方法可以生成既有歧视性又有稳定性的指纹。一组实验表明，我们的AFP方法对音频识别有效，对严重的音频扭曲，包括挑战性的速度变化和俯仰变速。

The rise of video-sharing platforms has attracted more and more people to shoot videos and upload them to the Internet. These videos mostly contain a carefully-edited background audio track, where serious speech change, pitch shifting and various types of audio effects may involve, and existing audio identification systems may fail to recognize the audio. To solve this problem, in this paper, we introduce the idea of contrastive learning to the task of audio fingerprinting (AFP). Contrastive learning is an unsupervised approach to learn representations that can effectively group similar samples and discriminate dissimilar ones. In our work, we consider an audio track and its differently distorted versions as similar while considering different audio tracks as dissimilar. Based on the momentum contrast (MoCo) framework, we devise a contrastive learning method for AFP, which can generate fingerprints that are both discriminative and robust. A set of experiments showed that our AFP method is effective for audio identification, with robustness to serious audio distortions, including the challenging speed change and pitch shifting.

下载PDF全文

下载文献需遵守相关版权规定

论文标题