字幕特征空间正规化用于音频字幕

论文标题

字幕特征空间正规化用于音频字幕

Caption Feature Space Regularization for Audio Captioning

论文作者

Zhang, Yiming, Yu, Hong, Du, Ruoyi, Ma, Zhanyu, Dong, Yuan

论文摘要

音频字幕旨在用人类语言描述音频剪辑的内容。由于音频的歧义，不同的人可能会以不同的方式看待相同的音频，从而导致字幕差异（即，一个音频可能与具有不同语义的几个字幕相关）。为此，通用音频字幕模型通过随机选择相关标题作为每个音频的基础真理来实现一对多培训。但是，它导致优化方向的显着差异并削弱了模型稳定性。为了消除这种负面影响，在本文中，我们提出了一个两个阶段的音频字幕框架：（i）在第一阶段，我们通过对比度学习构建了一个代理特征空间，以减少与同一音频相关的字幕之间的距离，（ii）在第二阶段，在第二阶段，在第二阶段，代理空间可用于额外的监督，以鼓励模型在方向上构成方向的范围，以使其在方向上均取决于该方向的范围，以使其在各个方向上均能在各个方向上均能在各个方向上均能在各个方向上均能在各个方向上均能在各个方向上均在范围内构成。我们使用四个常用的编码器和解码器体系结构在两个数据集上进行了广泛的实验。实验结果证明了该方法的有效性。该代码可在https://github.com/pris-cv/caption-feature-pace-regularization上获得。

Audio captioning aims at describing the content of audio clips with human language. Due to the ambiguity of audio, different people may perceive the same audio differently, resulting in caption disparities (i.e., one audio may correlate to several captions with diverse semantics). For that, general audio captioning models achieve the one-to-many training by randomly selecting a correlated caption as the ground truth for each audio. However, it leads to a significant variation in the optimization directions and weakens the model stability. To eliminate this negative effect, in this paper, we propose a two-stage framework for audio captioning: (i) in the first stage, via the contrastive learning, we construct a proxy feature space to reduce the distances between captions correlated to the same audio, and (ii) in the second stage, the proxy feature space is utilized as additional supervision to encourage the model to be optimized in the direction that benefits all the correlated captions. We conducted extensive experiments on two datasets using four commonly used encoder and decoder architectures. Experimental results demonstrate the effectiveness of the proposed method. The code is available at https://github.com/PRIS-CV/Caption-Feature-Space-Regularization.

下载PDF全文

下载文献需遵守相关版权规定

论文标题