论文标题
仔细听和讲述:基于剩余学习和γ音频表示的音频字幕系统
Listen carefully and tell: an audio captioning system based on residual learning and gammatone audio representation
论文作者
论文摘要
自动音频字幕是机器听力任务,其目标是使用免费文本描述音频。必须实现自动音频字幕系统,因为它接受音频作为输入,而输出作为文本描述,即信号的标题。此任务在许多应用程序中很有用,例如自动内容说明或机器对机器交互。在这项工作中,提出了基于编码器阶段的残差学习的自动音频字幕。编码器阶段是通过不同的残差网络配置实现的。解码器阶段(创建标题)是使用经常性层和注意机制运行的。选择的音频表示为γ。结果表明,这项工作中提出的框架超过了挑战结果中的基线系统。
Automated audio captioning is machine listening task whose goal is to describe an audio using free text. An automated audio captioning system has to be implemented as it accepts an audio as input and outputs as textual description, that is, the caption of the signal. This task can be useful in many applications such as automatic content description or machine-to-machine interaction. In this work, an automatic audio captioning based on residual learning on the encoder phase is proposed. The encoder phase is implemented via different Residual Networks configurations. The decoder phase (create the caption) is run using recurrent layers plus attention mechanism. The audio representation chosen has been Gammatone. Results show that the framework proposed in this work surpass the baseline system in challenge results.