基于语言的音频检索，并收集绑带层和对比度损失

论文标题

基于语言的音频检索，并收集绑带层和对比度损失

Language-Based Audio Retrieval with Converging Tied Layers and Contrastive Loss

论文作者

Koh, Andrew, Chng, Eng Siong

论文摘要

在本文中，我们解决了Dcase 2022中提出的新的基于语言的音频检索任务。首先，我们引入了一种简单，可扩展的体系结构，该体系结构将音频和文本编码器都联系在一起。其次，我们表明，使用此体系结构以及对比度损失，该模型可以显着击败基线模型的性能。最后，除了具有极低的训练记忆需求之外，我们还可以使用预告片的型号，而无需对其进行预感。我们测试我们的方法，并表明使用方法的组合可以显着击败基线得分。

In this paper, we tackle the new Language-Based Audio Retrieval task proposed in DCASE 2022. Firstly, we introduce a simple, scalable architecture which ties both the audio and text encoder together. Secondly, we show that using this architecture along with contrastive loss allows the model to significantly beat the performance of the baseline model. Finally, in addition to having an extremely low training memory requirement, we are able to use pretrained models as it is without needing to finetune them. We test our methods and show that using a combination of our methods beats the baseline scores significantly.

下载PDF全文

下载文献需遵守相关版权规定

论文标题